[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Mehul Batra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493042#comment-17493042
 ] 

Mehul Batra commented on SPARK-38218:
-

When can we expect the next release?

> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
> Attachments: Screenshot_20220214-013156.jpg, 
> image-2022-02-16-12-26-32-871.png
>
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493041#comment-17493041
 ] 

Hyukjin Kwon commented on SPARK-38218:
--

That'd be fixed from the next release since we fixed it in the dev branch.

> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
> Attachments: Screenshot_20220214-013156.jpg, 
> image-2022-02-16-12-26-32-871.png
>
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38227) Apply strict nullability of nested column in time window / session window

2022-02-15 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-38227:


 Summary: Apply strict nullability of nested column in time window 
/ session window
 Key: SPARK-38227
 URL: https://issues.apache.org/jira/browse/SPARK-38227
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.1, 3.3.0
Reporter: Jungtaek Lim


In TimeWindow and SessionWindow, we define dataType of these function 
expressions as StructType having two nested columns "start" and "end", which is 
"nullable".

And we replace these expressions in the analyzer via corresponding rules, 
TimeWindowing for TimeWindow, and SessionWindowing for SessionWindow.

The rules replace the function expressions with Alias, referring 
CreateNamedStruct. For the value side of CreateNamedStruct, we don't specify 
anything about nullability, which leads to a risk the value side may be 
interpreted (or optimized) as non-nullable, which would be different Spark 
would be expected.

We should make sure the nullability of columns in CreateNamedStruct remains the 
same with dataType definition on these function expressions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Mehul Batra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493029#comment-17493029
 ] 

Mehul Batra edited comment on SPARK-38218 at 2/16/22, 6:57 AM:
---

HI [~hyukjin.kwon] So if I download tgz from the spark downloads page will it 
download hadoop 3.3.1, because it still shows hadoop 3.2 in that section 
attaching screenshot for the same.
!image-2022-02-16-12-26-32-871.png|width=736,height=121!


was (Author: me_bat):
So if I download tgz from the spark downloads page will it download hadoop 
3.3.1, because it still shows hadoop 3.2 in that section attaching screenshot 
for the same.
!image-2022-02-16-12-26-32-871.png|width=736,height=121!

> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
> Attachments: Screenshot_20220214-013156.jpg, 
> image-2022-02-16-12-26-32-871.png
>
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Mehul Batra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493029#comment-17493029
 ] 

Mehul Batra commented on SPARK-38218:
-

So if I download tgz from the spark downloads page will it download hadoop 
3.3.1, because it still shows hadoop 3.2 in that section attaching screenshot 
for the same.
!image-2022-02-16-12-26-32-871.png|width=736,height=121!

> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
> Attachments: Screenshot_20220214-013156.jpg, 
> image-2022-02-16-12-26-32-871.png
>
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Mehul Batra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mehul Batra updated SPARK-38218:

Attachment: image-2022-02-16-12-26-32-871.png

> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
> Attachments: Screenshot_20220214-013156.jpg, 
> image-2022-02-16-12-26-32-871.png
>
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493007#comment-17493007
 ] 

Apache Spark commented on SPARK-38226:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35538

> Fix HiveCompatibilitySuite under ANSI mode
> --
>
> Key: SPARK-38226
> URL: https://issues.apache.org/jira/browse/SPARK-38226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> Fix 
> sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
>  under ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38226:


Assignee: (was: Apache Spark)

> Fix HiveCompatibilitySuite under ANSI mode
> --
>
> Key: SPARK-38226
> URL: https://issues.apache.org/jira/browse/SPARK-38226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> Fix 
> sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
>  under ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38226:


Assignee: Apache Spark

> Fix HiveCompatibilitySuite under ANSI mode
> --
>
> Key: SPARK-38226
> URL: https://issues.apache.org/jira/browse/SPARK-38226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Apache Spark
>Priority: Major
>
> Fix 
> sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
>  under ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493005#comment-17493005
 ] 

Apache Spark commented on SPARK-38226:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35538

> Fix HiveCompatibilitySuite under ANSI mode
> --
>
> Key: SPARK-38226
> URL: https://issues.apache.org/jira/browse/SPARK-38226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> Fix 
> sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
>  under ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode

2022-02-15 Thread Xinyi Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyi Yu updated SPARK-38226:
-
Description: Fix 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 under ANSI mode.

> Fix HiveCompatibilitySuite under ANSI mode
> --
>
> Key: SPARK-38226
> URL: https://issues.apache.org/jira/browse/SPARK-38226
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> Fix 
> sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
>  under ANSI mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38226) Fix HiveCompatibilitySuite under ANSI mode

2022-02-15 Thread Xinyi Yu (Jira)
Xinyi Yu created SPARK-38226:


 Summary: Fix HiveCompatibilitySuite under ANSI mode
 Key: SPARK-38226
 URL: https://issues.apache.org/jira/browse/SPARK-38226
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Xinyi Yu






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true

2022-02-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38173:
---

Assignee: Tongwei

> Quoted column cannot be recognized correctly when quotedRegexColumnNames is 
> true
> 
>
> Key: SPARK-38173
> URL: https://issues.apache.org/jira/browse/SPARK-38173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Tongwei
>Assignee: Tongwei
>Priority: Major
> Fix For: 3.3.0
>
>
> When spark.sql.parser.quotedRegexColumnNames=true
> {code:java}
>  SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code}
> The above query will throw an exception
> {code:java}
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
> 'multiply'
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266)
>         at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in 
> expression 'multiply'
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656)
>  {code}
> It works fine in hive
> {code:java}
> 0: jdbc:hive2://hiveserver-inc.> set hive.support.quoted.identifiers=none;

[jira] [Resolved] (SPARK-38173) Quoted column cannot be recognized correctly when quotedRegexColumnNames is true

2022-02-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38173.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35476
[https://github.com/apache/spark/pull/35476]

> Quoted column cannot be recognized correctly when quotedRegexColumnNames is 
> true
> 
>
> Key: SPARK-38173
> URL: https://issues.apache.org/jira/browse/SPARK-38173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Tongwei
>Priority: Major
> Fix For: 3.3.0
>
>
> When spark.sql.parser.quotedRegexColumnNames=true
> {code:java}
>  SELECT `(C3)?+.+`,`C1` * C2 FROM (SELECT 3 AS C1,2 AS C2,1 AS C3) T;{code}
> The above query will throw an exception
> {code:java}
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
> 'multiply'
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:370)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:266)
>         at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:44)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:266)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:261)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:275)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.sql.AnalysisException: Invalid usage of '*' in 
> expression 'multiply'
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
>         at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:49)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:155)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1700)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:342)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:408)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:244)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:406)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:359)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:339)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:1671)
>         at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:1656)
>  {code}
> It works fine in hive
> {code:java}
> 

[jira] [Resolved] (SPARK-38201) Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`

2022-02-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38201.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35509
[https://github.com/apache/spark/pull/35509]

> Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and 
> `overwrite`
> -
>
> Key: SPARK-38201
> URL: https://issues.apache.org/jira/browse/SPARK-38201
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.3.0
>
>
> KubernetesUtils#uploadFileToHadoopCompatibleFS defines the input parameters `
> delSrc` and `overwrite`,  but constants(false and true) are used when call `
> FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path src, 
> Path dst) ` method.
> `
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38201) Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and `overwrite`

2022-02-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38201:
-

Assignee: Yang Jie

> Fix KubernetesUtils#uploadFileToHadoopCompatibleFS use passed in `delSrc` and 
> `overwrite`
> -
>
> Key: SPARK-38201
> URL: https://issues.apache.org/jira/browse/SPARK-38201
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
>
> KubernetesUtils#uploadFileToHadoopCompatibleFS defines the input parameters `
> delSrc` and `overwrite`,  but constants(false and true) are used when call `
> FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path src, 
> Path dst) ` method.
> `
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38225) Complete input validation of function to_binary

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38225:


Assignee: (was: Apache Spark)

> Complete input validation of function to_binary
> ---
>
> Key: SPARK-38225
> URL: https://issues.apache.org/jira/browse/SPARK-38225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, function to_binary doesn't deal with the non-string {{format}} 
> parameter properly.
> For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting 
> error, rather than hint that encoding format is unsupported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38225) Complete input validation of function to_binary

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38225:


Assignee: Apache Spark

> Complete input validation of function to_binary
> ---
>
> Key: SPARK-38225
> URL: https://issues.apache.org/jira/browse/SPARK-38225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, function to_binary doesn't deal with the non-string {{format}} 
> parameter properly.
> For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting 
> error, rather than hint that encoding format is unsupported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38225) Complete input validation of function to_binary

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492985#comment-17492985
 ] 

Apache Spark commented on SPARK-38225:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/35533

> Complete input validation of function to_binary
> ---
>
> Key: SPARK-38225
> URL: https://issues.apache.org/jira/browse/SPARK-38225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, function to_binary doesn't deal with the non-string {{format}} 
> parameter properly.
> For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting 
> error, rather than hint that encoding format is unsupported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38225) Complete input validation of function to_binary

2022-02-15 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492984#comment-17492984
 ] 

Xinrong Meng commented on SPARK-38225:
--

I am working on that.

> Complete input validation of function to_binary
> ---
>
> Key: SPARK-38225
> URL: https://issues.apache.org/jira/browse/SPARK-38225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, function to_binary doesn't deal with the non-string {{format}} 
> parameter properly.
> For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting 
> error, rather than hint that encoding format is unsupported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38225) Complete input validation of function to_binary

2022-02-15 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-38225:


 Summary: Complete input validation of function to_binary
 Key: SPARK-38225
 URL: https://issues.apache.org/jira/browse/SPARK-38225
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Currently, function to_binary doesn't deal with the non-string {{format}} 
parameter properly.

For example, {{spark.sql("select to_binary('abc', 1)")}} raises casting error, 
rather than hint that encoding format is unsupported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38224) How do I get a lot of results in KDE

2022-02-15 Thread Ben Wan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Wan updated SPARK-38224:

Priority: Trivial  (was: Major)

> How do I get a lot of results in KDE
> 
>
> Key: SPARK-38224
> URL: https://issues.apache.org/jira/browse/SPARK-38224
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.4.5
>Reporter: Ben Wan
>Priority: Trivial
>
> I have a pyspark.DataFrame, I have converted one of the columns to RDD and 
> performed KDE, I need to get all the KDE estimates of the column and add a 
> new column in the DataFrame for subsequent work, how can I do it by Spark?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38224) How do I get a lot of results in KDE

2022-02-15 Thread Ben Wan (Jira)
Ben Wan created SPARK-38224:
---

 Summary: How do I get a lot of results in KDE
 Key: SPARK-38224
 URL: https://issues.apache.org/jira/browse/SPARK-38224
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 2.4.5
Reporter: Ben Wan


I have a pyspark.DataFrame, I have converted one of the columns to RDD and 
performed KDE, I need to get all the KDE estimates of the column and add a new 
column in the DataFrame for subsequent work, how can I do it by Spark?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38223) PersistentVolumeClaim does not work in clusters with multiple nodes

2022-02-15 Thread Zimo Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492960#comment-17492960
 ] 

Zimo Li commented on SPARK-38223:
-

[~hyukjin.kwon] Sorry I missed that. I just added a title.

> PersistentVolumeClaim does not work in clusters with multiple nodes
> ---
>
> Key: SPARK-38223
> URL: https://issues.apache.org/jira/browse/SPARK-38223
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.1
> Environment: 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
> [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes]
>  
>Reporter: Zimo Li
>Priority: Minor
>
> We are using {{spark-submit}} to establish a ThriftServer warehouse on Google 
> Kubernetes Engine. The Spark documentation on running on Kubernetes suggests 
> that we can use 
> [persistentVolumeClaim|https://kubernetes.io/docs/concepts/storage/volumes/#persistentvolumeclaim]
>  for Spark applications.
> {code:bash}
> spark-submit \
>   --master k8s://$KUBERNETES_SERVICE_HOST \
>   --deploy-mode cluster \
>   --class $THRIFTSERVER \
>   --conf spark.sql.catalogImplementation=hive \
>   --conf spark.sql.hive.metastore.sharedPrefixes=org.postgresql \
>   --conf spark.hadoop.hive.metastore.schema.verification=false \
>   --conf spark.hadoop.datanucleus.schema.autoCreateTables=true \
>   --conf spark.hadoop.datanucleus.autoCreateSchema=false \
>   --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \
>   --conf 
> spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver \
>   --conf spark.hadoop.javax.jdo.option.ConnectionUserName=spark \
>   --conf spark.hadoop.javax.jdo.option.ConnectionPassword=Password1! \
>   --conf spark.sql.warehouse.dir=$MOUNT_PATH \
>   --conf spark.kubernetes.driver.pod.name=spark-hive-thriftserver-driver \
>   --conf spark.kubernetes.driver.label.app.kubernetes.io/name=thriftserver \
>   --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME
>  \
>   --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH
>  \
>   --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false
>  \
>   --conf 
> spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME
>  \
>   --conf 
> spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH
>  \
>   --conf 
> spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false
>  \
>   --conf spark.kubernetes.executor.deleteOnTermination=true \
>   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-kube \
>   --conf spark.kubernetes.container.image=$IMAGE \
>   --conf spark.kubernetes.container.image.pullPolicy=Always \
>   --conf spark.executor.memory=2g \
>   --conf spark.driver.memory=2g \
>   local:///$JAR {code}
> When it ran, it created one driver and two executors. Each of these wanted to 
> use the same pvc. Unfortunately, at least one of these pods was scheduled on 
> a different node from the rest. As GKE mounts pvs to nodes in order to honor 
> pvcs for pods, that odd pod out was unable to attach the pv:
> {code:java}
> FailedMount
> Unable to attach or mount volumes: unmounted volumes=[spark-warehouse], 
> unattached volumes=[kube-api-access-grfld spark-conf-volume-exec 
> spark-warehouse spark-local-dir-1]: timed out waiting for the condition {code}
> This is because GKE like many cloud providers does not support 
> {{ReadWriteMany}} for pvcs/pvs.
> 
> I suggest changing the documentation not to suggest using pvcs for 
> ThriftServers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38223) PersistentVolumeClaim does not work in clusters with multiple nodes

2022-02-15 Thread Zimo Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zimo Li updated SPARK-38223:

Summary: PersistentVolumeClaim does not work in clusters with multiple 
nodes  (was: Spark)

> PersistentVolumeClaim does not work in clusters with multiple nodes
> ---
>
> Key: SPARK-38223
> URL: https://issues.apache.org/jira/browse/SPARK-38223
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.1
> Environment: 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
> [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes]
>  
>Reporter: Zimo Li
>Priority: Minor
>
> We are using {{spark-submit}} to establish a ThriftServer warehouse on Google 
> Kubernetes Engine. The Spark documentation on running on Kubernetes suggests 
> that we can use 
> [persistentVolumeClaim|https://kubernetes.io/docs/concepts/storage/volumes/#persistentvolumeclaim]
>  for Spark applications.
> {code:bash}
> spark-submit \
>   --master k8s://$KUBERNETES_SERVICE_HOST \
>   --deploy-mode cluster \
>   --class $THRIFTSERVER \
>   --conf spark.sql.catalogImplementation=hive \
>   --conf spark.sql.hive.metastore.sharedPrefixes=org.postgresql \
>   --conf spark.hadoop.hive.metastore.schema.verification=false \
>   --conf spark.hadoop.datanucleus.schema.autoCreateTables=true \
>   --conf spark.hadoop.datanucleus.autoCreateSchema=false \
>   --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \
>   --conf 
> spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver \
>   --conf spark.hadoop.javax.jdo.option.ConnectionUserName=spark \
>   --conf spark.hadoop.javax.jdo.option.ConnectionPassword=Password1! \
>   --conf spark.sql.warehouse.dir=$MOUNT_PATH \
>   --conf spark.kubernetes.driver.pod.name=spark-hive-thriftserver-driver \
>   --conf spark.kubernetes.driver.label.app.kubernetes.io/name=thriftserver \
>   --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME
>  \
>   --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH
>  \
>   --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false
>  \
>   --conf 
> spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME
>  \
>   --conf 
> spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH
>  \
>   --conf 
> spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false
>  \
>   --conf spark.kubernetes.executor.deleteOnTermination=true \
>   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-kube \
>   --conf spark.kubernetes.container.image=$IMAGE \
>   --conf spark.kubernetes.container.image.pullPolicy=Always \
>   --conf spark.executor.memory=2g \
>   --conf spark.driver.memory=2g \
>   local:///$JAR {code}
> When it ran, it created one driver and two executors. Each of these wanted to 
> use the same pvc. Unfortunately, at least one of these pods was scheduled on 
> a different node from the rest. As GKE mounts pvs to nodes in order to honor 
> pvcs for pods, that odd pod out was unable to attach the pv:
> {code:java}
> FailedMount
> Unable to attach or mount volumes: unmounted volumes=[spark-warehouse], 
> unattached volumes=[kube-api-access-grfld spark-conf-volume-exec 
> spark-warehouse spark-local-dir-1]: timed out waiting for the condition {code}
> This is because GKE like many cloud providers does not support 
> {{ReadWriteMany}} for pvcs/pvs.
> 
> I suggest changing the documentation not to suggest using pvcs for 
> ThriftServers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38221) Group by a stream of complex expressions fails

2022-02-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38221.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/35537

> Group by a stream of complex expressions fails
> --
>
> Key: SPARK-38221
> URL: https://issues.apache.org/jira/browse/SPARK-38221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> This query fails:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in 
> [id#4,_groupingexpression#23]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:534)
>   at scala.collection.TraversableOnce.count(TraversableOnce.scala:152)
>   at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145)
>   at scala.collection.AbstractTraversable.count(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623)
> {noformat}
> However, replace {{Stream}} with {{Seq}} and it works:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> +++---+
> |(id + 1)|(id + 2)|sum(id)|
> +++---+
> |2   |3   |1  |
> +++---+
> scala> 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38221) Group by a stream of complex expressions fails

2022-02-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38221:
-
Fix Version/s: 3.3.0
   3.2.2

> Group by a stream of complex expressions fails
> --
>
> Key: SPARK-38221
> URL: https://issues.apache.org/jira/browse/SPARK-38221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> This query fails:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in 
> [id#4,_groupingexpression#23]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:534)
>   at scala.collection.TraversableOnce.count(TraversableOnce.scala:152)
>   at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145)
>   at scala.collection.AbstractTraversable.count(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623)
> {noformat}
> However, replace {{Stream}} with {{Seq}} and it works:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> +++---+
> |(id + 1)|(id + 2)|sum(id)|
> +++---+
> |2   |3   |1  |
> +++---+
> scala> 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38218.
--
Resolution: Duplicate

probably it's a duplicate of SPARK-37445

> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
> Attachments: Screenshot_20220214-013156.jpg
>
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492952#comment-17492952
 ] 

Hyukjin Kwon commented on SPARK-38218:
--

[~ME_BAT] the images are broken. mind checking this please?

> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
> Attachments: Screenshot_20220214-013156.jpg
>
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38223) Spark

2022-02-15 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492951#comment-17492951
 ] 

Hyukjin Kwon commented on SPARK-38223:
--

[~lizimo] mind fixing the JIRA title to summarize the issue?

> Spark
> -
>
> Key: SPARK-38223
> URL: https://issues.apache.org/jira/browse/SPARK-38223
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.1
> Environment: 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
> [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes]
>  
>Reporter: Zimo Li
>Priority: Minor
>
> We are using {{spark-submit}} to establish a ThriftServer warehouse on Google 
> Kubernetes Engine. The Spark documentation on running on Kubernetes suggests 
> that we can use 
> [persistentVolumeClaim|https://kubernetes.io/docs/concepts/storage/volumes/#persistentvolumeclaim]
>  for Spark applications.
> {code:bash}
> spark-submit \
>   --master k8s://$KUBERNETES_SERVICE_HOST \
>   --deploy-mode cluster \
>   --class $THRIFTSERVER \
>   --conf spark.sql.catalogImplementation=hive \
>   --conf spark.sql.hive.metastore.sharedPrefixes=org.postgresql \
>   --conf spark.hadoop.hive.metastore.schema.verification=false \
>   --conf spark.hadoop.datanucleus.schema.autoCreateTables=true \
>   --conf spark.hadoop.datanucleus.autoCreateSchema=false \
>   --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \
>   --conf 
> spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver \
>   --conf spark.hadoop.javax.jdo.option.ConnectionUserName=spark \
>   --conf spark.hadoop.javax.jdo.option.ConnectionPassword=Password1! \
>   --conf spark.sql.warehouse.dir=$MOUNT_PATH \
>   --conf spark.kubernetes.driver.pod.name=spark-hive-thriftserver-driver \
>   --conf spark.kubernetes.driver.label.app.kubernetes.io/name=thriftserver \
>   --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME
>  \
>   --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH
>  \
>   --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false
>  \
>   --conf 
> spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME
>  \
>   --conf 
> spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH
>  \
>   --conf 
> spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false
>  \
>   --conf spark.kubernetes.executor.deleteOnTermination=true \
>   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-kube \
>   --conf spark.kubernetes.container.image=$IMAGE \
>   --conf spark.kubernetes.container.image.pullPolicy=Always \
>   --conf spark.executor.memory=2g \
>   --conf spark.driver.memory=2g \
>   local:///$JAR {code}
> When it ran, it created one driver and two executors. Each of these wanted to 
> use the same pvc. Unfortunately, at least one of these pods was scheduled on 
> a different node from the rest. As GKE mounts pvs to nodes in order to honor 
> pvcs for pods, that odd pod out was unable to attach the pv:
> {code:java}
> FailedMount
> Unable to attach or mount volumes: unmounted volumes=[spark-warehouse], 
> unattached volumes=[kube-api-access-grfld spark-conf-volume-exec 
> spark-warehouse spark-local-dir-1]: timed out waiting for the condition {code}
> This is because GKE like many cloud providers does not support 
> {{ReadWriteMany}} for pvcs/pvs.
> 
> I suggest changing the documentation not to suggest using pvcs for 
> ThriftServers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38221) Group by a stream of complex expressions fails

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38221:


Assignee: Apache Spark

> Group by a stream of complex expressions fails
> --
>
> Key: SPARK-38221
> URL: https://issues.apache.org/jira/browse/SPARK-38221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> This query fails:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in 
> [id#4,_groupingexpression#23]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:534)
>   at scala.collection.TraversableOnce.count(TraversableOnce.scala:152)
>   at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145)
>   at scala.collection.AbstractTraversable.count(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623)
> {noformat}
> However, replace {{Stream}} with {{Seq}} and it works:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> +++---+
> |(id + 1)|(id + 2)|sum(id)|
> +++---+
> |2   |3   |1  |
> +++---+
> scala> 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38221) Group by a stream of complex expressions fails

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492944#comment-17492944
 ] 

Apache Spark commented on SPARK-38221:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/35537

> Group by a stream of complex expressions fails
> --
>
> Key: SPARK-38221
> URL: https://issues.apache.org/jira/browse/SPARK-38221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This query fails:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in 
> [id#4,_groupingexpression#23]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:534)
>   at scala.collection.TraversableOnce.count(TraversableOnce.scala:152)
>   at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145)
>   at scala.collection.AbstractTraversable.count(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623)
> {noformat}
> However, replace {{Stream}} with {{Seq}} and it works:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> +++---+
> |(id + 1)|(id + 2)|sum(id)|
> +++---+
> |2   |3   |1  |
> +++---+
> scala> 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38221) Group by a stream of complex expressions fails

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38221:


Assignee: (was: Apache Spark)

> Group by a stream of complex expressions fails
> --
>
> Key: SPARK-38221
> URL: https://issues.apache.org/jira/browse/SPARK-38221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This query fails:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in 
> [id#4,_groupingexpression#23]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:534)
>   at scala.collection.TraversableOnce.count(TraversableOnce.scala:152)
>   at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145)
>   at scala.collection.AbstractTraversable.count(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623)
> {noformat}
> However, replace {{Stream}} with {{Seq}} and it works:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> +++---+
> |(id + 1)|(id + 2)|sum(id)|
> +++---+
> |2   |3   |1  |
> +++---+
> scala> 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38222) Expose Node Description attribute in SQL Rest API

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38222:


Assignee: (was: Apache Spark)

> Expose Node Description attribute in SQL Rest API
> -
>
> Key: SPARK-38222
> URL: https://issues.apache.org/jira/browse/SPARK-38222
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> Currently, SQL public Rest API does not expose node description and it is 
> useful to have nodeDesc attribute at query level to have more details such as:
> {code:java}
> - Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join 
> type and which leg is built for BHJ. 
> - HashAggregate => aggregated keys and agg functions
> - List can be extended for other physical operators.{code}
> *Current Sample Json Result:*
> {code:java}
> {
>     "nodeId" : 14,
>     "nodeName" : "BroadcastHashJoin",
>     "wholeStageCodegenId" : 3,
>     "stageIds" : [ 5 ],
>     "metrics" : [ {
>           "name" : "number of output rows",
>           "value" : {
>         "amount" : "2"
>           }
>     }
> },
> ...
> {
>     "nodeId" : 8,
>     "nodeName" : "HashAggregate",
>     "wholeStageCodegenId" : 4,
>     "stageIds" : [ 8 ],
>     "metrics" : [ {
>       "name" : "spill size",
>       "value" : {
>         "amount" : "0.0"
>       }
>     }
> } {code}
> *New* {*}Sample Json Result{*}{*}:{*}
> {code:java}
> {
>     "nodeId" : 14,
>     "nodeName" : "BroadcastHashJoin",
>     "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false",
>     "wholeStageCodegenId" : 3,
>     "stageIds" : [ 5 ],
>     "metrics" : [ {
>           "name" : "number of output rows",
>           "value" : {
>         "amount" : "2"
>           }
>     }
> },
> ...
> {
>     "nodeId" : 8,
>     "nodeName" : "HashAggregate",
>     "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], 
> functions=[avg(cast(age#6 as bigint)), avg(salary#18)])",
>     "wholeStageCodegenId" : 4,
>     "stageIds" : [ 8 ],
>     "metrics" : [ {
>       "name" : "spill size",
>       "value" : {
>         "amount" : "0.0"
>       }
>     }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38222) Expose Node Description attribute in SQL Rest API

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492927#comment-17492927
 ] 

Apache Spark commented on SPARK-38222:
--

User 'erenavsarogullari' has created a pull request for this issue:
https://github.com/apache/spark/pull/35536

> Expose Node Description attribute in SQL Rest API
> -
>
> Key: SPARK-38222
> URL: https://issues.apache.org/jira/browse/SPARK-38222
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> Currently, SQL public Rest API does not expose node description and it is 
> useful to have nodeDesc attribute at query level to have more details such as:
> {code:java}
> - Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join 
> type and which leg is built for BHJ. 
> - HashAggregate => aggregated keys and agg functions
> - List can be extended for other physical operators.{code}
> *Current Sample Json Result:*
> {code:java}
> {
>     "nodeId" : 14,
>     "nodeName" : "BroadcastHashJoin",
>     "wholeStageCodegenId" : 3,
>     "stageIds" : [ 5 ],
>     "metrics" : [ {
>           "name" : "number of output rows",
>           "value" : {
>         "amount" : "2"
>           }
>     }
> },
> ...
> {
>     "nodeId" : 8,
>     "nodeName" : "HashAggregate",
>     "wholeStageCodegenId" : 4,
>     "stageIds" : [ 8 ],
>     "metrics" : [ {
>       "name" : "spill size",
>       "value" : {
>         "amount" : "0.0"
>       }
>     }
> } {code}
> *New* {*}Sample Json Result{*}{*}:{*}
> {code:java}
> {
>     "nodeId" : 14,
>     "nodeName" : "BroadcastHashJoin",
>     "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false",
>     "wholeStageCodegenId" : 3,
>     "stageIds" : [ 5 ],
>     "metrics" : [ {
>           "name" : "number of output rows",
>           "value" : {
>         "amount" : "2"
>           }
>     }
> },
> ...
> {
>     "nodeId" : 8,
>     "nodeName" : "HashAggregate",
>     "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], 
> functions=[avg(cast(age#6 as bigint)), avg(salary#18)])",
>     "wholeStageCodegenId" : 4,
>     "stageIds" : [ 8 ],
>     "metrics" : [ {
>       "name" : "spill size",
>       "value" : {
>         "amount" : "0.0"
>       }
>     }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38222) Expose Node Description attribute in SQL Rest API

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38222:


Assignee: Apache Spark

> Expose Node Description attribute in SQL Rest API
> -
>
> Key: SPARK-38222
> URL: https://issues.apache.org/jira/browse/SPARK-38222
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Eren Avsarogullari
>Assignee: Apache Spark
>Priority: Major
>
> Currently, SQL public Rest API does not expose node description and it is 
> useful to have nodeDesc attribute at query level to have more details such as:
> {code:java}
> - Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join 
> type and which leg is built for BHJ. 
> - HashAggregate => aggregated keys and agg functions
> - List can be extended for other physical operators.{code}
> *Current Sample Json Result:*
> {code:java}
> {
>     "nodeId" : 14,
>     "nodeName" : "BroadcastHashJoin",
>     "wholeStageCodegenId" : 3,
>     "stageIds" : [ 5 ],
>     "metrics" : [ {
>           "name" : "number of output rows",
>           "value" : {
>         "amount" : "2"
>           }
>     }
> },
> ...
> {
>     "nodeId" : 8,
>     "nodeName" : "HashAggregate",
>     "wholeStageCodegenId" : 4,
>     "stageIds" : [ 8 ],
>     "metrics" : [ {
>       "name" : "spill size",
>       "value" : {
>         "amount" : "0.0"
>       }
>     }
> } {code}
> *New* {*}Sample Json Result{*}{*}:{*}
> {code:java}
> {
>     "nodeId" : 14,
>     "nodeName" : "BroadcastHashJoin",
>     "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false",
>     "wholeStageCodegenId" : 3,
>     "stageIds" : [ 5 ],
>     "metrics" : [ {
>           "name" : "number of output rows",
>           "value" : {
>         "amount" : "2"
>           }
>     }
> },
> ...
> {
>     "nodeId" : 8,
>     "nodeName" : "HashAggregate",
>     "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], 
> functions=[avg(cast(age#6 as bigint)), avg(salary#18)])",
>     "wholeStageCodegenId" : 4,
>     "stageIds" : [ 8 ],
>     "metrics" : [ {
>       "name" : "spill size",
>       "value" : {
>         "amount" : "0.0"
>       }
>     }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38220) Upgrade `commons-math3` to 3.6.1

2022-02-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38220.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35535
[https://github.com/apache/spark/pull/35535]

> Upgrade `commons-math3` to 3.6.1
> 
>
> Key: SPARK-38220
> URL: https://issues.apache.org/jira/browse/SPARK-38220
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38220) Upgrade `commons-math3` to 3.6.1

2022-02-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38220:


Assignee: Dongjoon Hyun

> Upgrade `commons-math3` to 3.6.1
> 
>
> Key: SPARK-38220
> URL: https://issues.apache.org/jira/browse/SPARK-38220
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38223) Spark

2022-02-15 Thread Zimo Li (Jira)
Zimo Li created SPARK-38223:
---

 Summary: Spark
 Key: SPARK-38223
 URL: https://issues.apache.org/jira/browse/SPARK-38223
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.2.1
 Environment: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html#how-it-works]

[https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]

[https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes]

 
Reporter: Zimo Li


We are using {{spark-submit}} to establish a ThriftServer warehouse on Google 
Kubernetes Engine. The Spark documentation on running on Kubernetes suggests 
that we can use 
[persistentVolumeClaim|https://kubernetes.io/docs/concepts/storage/volumes/#persistentvolumeclaim]
 for Spark applications.
{code:bash}
spark-submit \
  --master k8s://$KUBERNETES_SERVICE_HOST \
  --deploy-mode cluster \
  --class $THRIFTSERVER \
  --conf spark.sql.catalogImplementation=hive \
  --conf spark.sql.hive.metastore.sharedPrefixes=org.postgresql \
  --conf spark.hadoop.hive.metastore.schema.verification=false \
  --conf spark.hadoop.datanucleus.schema.autoCreateTables=true \
  --conf spark.hadoop.datanucleus.autoCreateSchema=false \
  --conf spark.sql.parquet.int96RebaseModeInWrite=CORRECTED \
  --conf 
spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver \
  --conf spark.hadoop.javax.jdo.option.ConnectionUserName=spark \
  --conf spark.hadoop.javax.jdo.option.ConnectionPassword=Password1! \
  --conf spark.sql.warehouse.dir=$MOUNT_PATH \
  --conf spark.kubernetes.driver.pod.name=spark-hive-thriftserver-driver \
  --conf spark.kubernetes.driver.label.app.kubernetes.io/name=thriftserver \
  --conf 
spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME
 \
  --conf 
spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH
 \
  --conf 
spark.kubernetes.executor.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false
 \
  --conf 
spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.options.claimName=$CLAIM_NAME
 \
  --conf 
spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.path=$MOUNT_PATH
 \
  --conf 
spark.kubernetes.driver.volumes.persistentVolumeClaim.$VOLUME_NAME.mount.readOnly=false
 \
  --conf spark.kubernetes.executor.deleteOnTermination=true \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-kube \
  --conf spark.kubernetes.container.image=$IMAGE \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.executor.memory=2g \
  --conf spark.driver.memory=2g \
  local:///$JAR {code}
When it ran, it created one driver and two executors. Each of these wanted to 
use the same pvc. Unfortunately, at least one of these pods was scheduled on a 
different node from the rest. As GKE mounts pvs to nodes in order to honor pvcs 
for pods, that odd pod out was unable to attach the pv:
{code:java}
FailedMount
Unable to attach or mount volumes: unmounted volumes=[spark-warehouse], 
unattached volumes=[kube-api-access-grfld spark-conf-volume-exec 
spark-warehouse spark-local-dir-1]: timed out waiting for the condition {code}
This is because GKE like many cloud providers does not support 
{{ReadWriteMany}} for pvcs/pvs.

I suggest changing the documentation not to suggest using pvcs for 
ThriftServers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38222) Expose nodeDesc attribute in SQL Rest API

2022-02-15 Thread Eren Avsarogullari (Jira)
Eren Avsarogullari created SPARK-38222:
--

 Summary: Expose nodeDesc attribute in SQL Rest API
 Key: SPARK-38222
 URL: https://issues.apache.org/jira/browse/SPARK-38222
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Eren Avsarogullari


Currently, SQL public Rest API does not expose node description and it is 
useful to have nodeDesc attribute at query level to have more details such as:
{code:java}
- Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join 
type and which leg is built for BHJ. 
- HashAggregate => aggregated keys and agg functions
- List can be extended for other physical operators.{code}
*Current Sample Json Result:*
{code:java}
{
    "nodeId" : 14,
    "nodeName" : "BroadcastHashJoin",
    "wholeStageCodegenId" : 3,
    "stageIds" : [ 5 ],
    "metrics" : [ {
          "name" : "number of output rows",
          "value" : {
        "amount" : "2"
          }
    }
},
...
{
    "nodeId" : 8,
    "nodeName" : "HashAggregate",
    "wholeStageCodegenId" : 4,
    "stageIds" : [ 8 ],
    "metrics" : [ {
      "name" : "spill size",
      "value" : {
        "amount" : "0.0"
      }
    }
} {code}
*New* {*}Sample Json Result{*}{*}:{*}
{code:java}
{
    "nodeId" : 14,
    "nodeName" : "BroadcastHashJoin",
    "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false",
    "wholeStageCodegenId" : 3,
    "stageIds" : [ 5 ],
    "metrics" : [ {
          "name" : "number of output rows",
          "value" : {
        "amount" : "2"
          }
    }
},
...
{
    "nodeId" : 8,
    "nodeName" : "HashAggregate",
    "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], 
functions=[avg(cast(age#6 as bigint)), avg(salary#18)])",
    "wholeStageCodegenId" : 4,
    "stageIds" : [ 8 ],
    "metrics" : [ {
      "name" : "spill size",
      "value" : {
        "amount" : "0.0"
      }
    }
} {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38222) Expose Node Description attribute in SQL Rest API

2022-02-15 Thread Eren Avsarogullari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-38222:
---
Summary: Expose Node Description attribute in SQL Rest API  (was: Expose 
nodeDesc attribute in SQL Rest API)

> Expose Node Description attribute in SQL Rest API
> -
>
> Key: SPARK-38222
> URL: https://issues.apache.org/jira/browse/SPARK-38222
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Eren Avsarogullari
>Priority: Major
>
> Currently, SQL public Rest API does not expose node description and it is 
> useful to have nodeDesc attribute at query level to have more details such as:
> {code:java}
> - Join Operators(BHJ, SMJ, SHJ) => when correlating join operator with join 
> type and which leg is built for BHJ. 
> - HashAggregate => aggregated keys and agg functions
> - List can be extended for other physical operators.{code}
> *Current Sample Json Result:*
> {code:java}
> {
>     "nodeId" : 14,
>     "nodeName" : "BroadcastHashJoin",
>     "wholeStageCodegenId" : 3,
>     "stageIds" : [ 5 ],
>     "metrics" : [ {
>           "name" : "number of output rows",
>           "value" : {
>         "amount" : "2"
>           }
>     }
> },
> ...
> {
>     "nodeId" : 8,
>     "nodeName" : "HashAggregate",
>     "wholeStageCodegenId" : 4,
>     "stageIds" : [ 8 ],
>     "metrics" : [ {
>       "name" : "spill size",
>       "value" : {
>         "amount" : "0.0"
>       }
>     }
> } {code}
> *New* {*}Sample Json Result{*}{*}:{*}
> {code:java}
> {
>     "nodeId" : 14,
>     "nodeName" : "BroadcastHashJoin",
>     "nodeDesc" : "BroadcastHashJoin [id#4], [id#24], Inner, BuildLeft, false",
>     "wholeStageCodegenId" : 3,
>     "stageIds" : [ 5 ],
>     "metrics" : [ {
>           "name" : "number of output rows",
>           "value" : {
>         "amount" : "2"
>           }
>     }
> },
> ...
> {
>     "nodeId" : 8,
>     "nodeName" : "HashAggregate",
>     "nodeDesc" : "HashAggregate(keys=[name#5, age#6, salary#18], 
> functions=[avg(cast(age#6 as bigint)), avg(salary#18)])",
>     "wholeStageCodegenId" : 4,
>     "stageIds" : [ 8 ],
>     "metrics" : [ {
>       "name" : "spill size",
>       "value" : {
>         "amount" : "0.0"
>       }
>     }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38205) The columns in state schema should be relaxed to be nullable

2022-02-15 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492895#comment-17492895
 ] 

Jungtaek Lim commented on SPARK-38205:
--

I realized the output schema should also be nullable (since the operator will 
produce output from state), and now puzzled whether there may be cases I’m 
going to break the existing query (DSv2 sink may check the nullability when 
writing).

I guess another way is never changing the nullability on optimizer and keep the 
nullability check in state. I would rely on less invasive way if there is one, 
since the lifetime of streaming query is long, across Spark versions, and 
compatibility is a major concern.

> The columns in state schema should be relaxed to be nullable
> 
>
> Key: SPARK-38205
> URL: https://issues.apache.org/jira/browse/SPARK-38205
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Starting from SPARK-27237, Spark validates the schema of state across query 
> runs to make sure it doesn't fall into more weird issue like SIGSEGV on the 
> runtime.
> The comparison logic is reasonable in terms of nullability; it has below 
> matrices:
> ||existing schema||new schema||allowed||
> |nullable|nullable|O|
> |nullable|non-nullable|O|
> |non-nullable|nullable|X|
> |non-nullable|non-nullable|O|
> What we miss here is, the nullability of the column can be changed in the 
> optimizer (mostly nullable to non-nullable), and the optimization about 
> nullability could be applied differently with any simple changes.
> So this scenario is hypothetically possible:
> 1. At the first run of the query, optimizer marks some columns from nullable 
> to non-nullable, and it goes to the schema of the state. (state schema has a 
> column with non-nullable)
> 2. At the second run of the query (possibly with code modification or 
> upgrading Spark version), optimizer no longer marks such columns from 
> nullable to non-nullable, and it goes with comparison of the schema of the 
> state (existing vs new), comparing non-nullable (existing) vs nullable (new), 
> which is NOT allowed.
> In terms of storage view for state store, it is not required to determine the 
> column as non-nullable vs nullable. Interface-wise, state store has no 
> concept of schema; so it is safe to relax such constraint, and open the 
> chance for optimizer to do whatever it wants and doesn't break stateful 
> operators.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38221) Group by a stream of complex expressions fails

2022-02-15 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492869#comment-17492869
 ] 

Bruce Robbins commented on SPARK-38221:
---

I think I have an idea what's going on. I will submit a PR soon.

> Group by a stream of complex expressions fails
> --
>
> Key: SPARK-38221
> URL: https://issues.apache.org/jira/browse/SPARK-38221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This query fails:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in 
> [id#4,_groupingexpression#23]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
>   at scala.collection.immutable.Stream.foreach(Stream.scala:534)
>   at scala.collection.TraversableOnce.count(TraversableOnce.scala:152)
>   at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145)
>   at scala.collection.AbstractTraversable.count(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623)
> {noformat}
> However, replace {{Stream}} with {{Seq}} and it works:
> {noformat}
> scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): 
> _*).sum("id").show(false)
> +++---+
> |(id + 1)|(id + 2)|sum(id)|
> +++---+
> |2   |3   |1  |
> +++---+
> scala> 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38221) Group by a stream of complex expressions fails

2022-02-15 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-38221:
-

 Summary: Group by a stream of complex expressions fails
 Key: SPARK-38221
 URL: https://issues.apache.org/jira/browse/SPARK-38221
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.3.0
Reporter: Bruce Robbins


This query fails:
{noformat}
scala> Seq(1).toDF("id").groupBy(Stream($"id" + 1, $"id" + 2): 
_*).sum("id").show(false)
java.lang.IllegalStateException: Couldn't find _groupingexpression#24 in 
[id#4,_groupingexpression#23]
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
  at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:94)
  at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
  at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
  at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
  at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
  at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1173)
  at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1163)
  at scala.collection.immutable.Stream.foreach(Stream.scala:534)
  at scala.collection.TraversableOnce.count(TraversableOnce.scala:152)
  at scala.collection.TraversableOnce.count$(TraversableOnce.scala:145)
  at scala.collection.AbstractTraversable.count(Traversable.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:293)
  at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:623)
{noformat}
However, replace {{Stream}} with {{Seq}} and it works:
{noformat}
scala> Seq(1).toDF("id").groupBy(Seq($"id" + 1, $"id" + 2): 
_*).sum("id").show(false)
+++---+
|(id + 1)|(id + 2)|sum(id)|
+++---+
|2   |3   |1  |
+++---+

scala> 
{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-15 Thread kk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492824#comment-17492824
 ] 

kk commented on SPARK-38115:


Is there any config as such to stop using FileOutputCommiter, because we didn't 
set any conf explicitly to use the committers.

And more over when overwriting on s3:// then i don't have a problem of 
_temporary. Problem comes if our path has s3a://

Just I am looking if I can use conf/options to manage temporary location as 
staging and have target path as primary

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38220) Upgrade `commons-math3` to 3.6.1

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38220:


Assignee: Apache Spark

> Upgrade `commons-math3` to 3.6.1
> 
>
> Key: SPARK-38220
> URL: https://issues.apache.org/jira/browse/SPARK-38220
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38220) Upgrade `commons-math3` to 3.6.1

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38220:


Assignee: (was: Apache Spark)

> Upgrade `commons-math3` to 3.6.1
> 
>
> Key: SPARK-38220
> URL: https://issues.apache.org/jira/browse/SPARK-38220
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38220) Upgrade `commons-math3` to 3.6.1

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492823#comment-17492823
 ] 

Apache Spark commented on SPARK-38220:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35535

> Upgrade `commons-math3` to 3.6.1
> 
>
> Key: SPARK-38220
> URL: https://issues.apache.org/jira/browse/SPARK-38220
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, MLlib
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38220) Upgrade `commons-math3` to 3.6.1

2022-02-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-38220:
-

 Summary: Upgrade `commons-math3` to 3.6.1
 Key: SPARK-38220
 URL: https://issues.apache.org/jira/browse/SPARK-38220
 Project: Spark
  Issue Type: Improvement
  Components: Build, MLlib
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-15 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492810#comment-17492810
 ] 

Steve Loughran commented on SPARK-38115:


* stop using the classic FileOutputCommitter for your work, unless you like 
waiting a long time for your jobs to complete. along with a risk of corrupt 
data in the presence of worker failures.
* the choice of where temporary paths go is a function of the committer, not 
the spark codebase. the s3a staging committer uses the local fs. for example
* the magic committer does work under _temporary, but it doesn't write the 
final data there. it's "magic", after all. l

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38115) No spark conf to control the path of _temporary when writing to target filesystem

2022-02-15 Thread kk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492785#comment-17492785
 ] 

kk commented on SPARK-38115:


Hello [~hyukjin.kwon] did you get a chance to look into this

> No spark conf to control the path of _temporary when writing to target 
> filesystem
> -
>
> Key: SPARK-38115
> URL: https://issues.apache.org/jira/browse/SPARK-38115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.2.1
>Reporter: kk
>Priority: Minor
>  Labels: spark, spark-conf, spark-sql, spark-submit
>
> No default spark conf or param to control the '_temporary' path when writing 
> to filesystem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38130) array_sort does not allow non-orderable datatypes

2022-02-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38130.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35426
[https://github.com/apache/spark/pull/35426]

> array_sort does not allow non-orderable datatypes
> -
>
> Key: SPARK-38130
> URL: https://issues.apache.org/jira/browse/SPARK-38130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment:  
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
> Fix For: 3.3.0
>
>
>  {{array_sort}} has check to see if the entries it has to sort are orderable.
> I think this check should be removed.  Because even entries which are not 
> orderable can have a lambda function which makes them orderable.
> {code:java}
> Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - 
> cardinality(y))"){code}
> fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array_sort(`a`, 
> lambdafunction((cardinality(namedlambdavariable()) - 
> cardinality(namedlambdavariable())), namedlambdavariable(), 
> namedlambdavariable()))' due to data type mismatch: array_sort does not 
> support sorting array of type map which is not orderable {code}
> While the case where this check is relevant, fails with a different error 
> which is triggered earlier in the code path:
> {code:java}
> > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> > "b").selectExpr("array_sort(a)"){code}
> Fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '(namedlambdavariable() < namedlambdavariable())' due to data type mismatch: 
> LessThan does not support ordering on type map; line 1 pos 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38130) array_sort does not allow non-orderable datatypes

2022-02-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38130:
---

Assignee: Steven Aerts

> array_sort does not allow non-orderable datatypes
> -
>
> Key: SPARK-38130
> URL: https://issues.apache.org/jira/browse/SPARK-38130
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment:  
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
>
>  {{array_sort}} has check to see if the entries it has to sort are orderable.
> I think this check should be removed.  Because even entries which are not 
> orderable can have a lambda function which makes them orderable.
> {code:java}
> Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> "b").selectExpr("array_sort(a, (x,y) -> cardinality(x) - 
> cardinality(y))"){code}
> fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array_sort(`a`, 
> lambdafunction((cardinality(namedlambdavariable()) - 
> cardinality(namedlambdavariable())), namedlambdavariable(), 
> namedlambdavariable()))' due to data type mismatch: array_sort does not 
> support sorting array of type map which is not orderable {code}
> While the case where this check is relevant, fails with a different error 
> which is triggered earlier in the code path:
> {code:java}
> > Seq((Array[Map[String, Int]](Map("a" -> 1), Map()), "x")).toDF("a", 
> > "b").selectExpr("array_sort(a)"){code}
> Fails with:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 
> '(namedlambdavariable() < namedlambdavariable())' due to data type mismatch: 
> LessThan does not support ordering on type map; line 1 pos 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-15 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36808:
---
Fix Version/s: 3.2.2

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-15 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36808:
---
Affects Version/s: 3.2.1

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37425) Inline type hints for python/pyspark/mllib/recommendation.py

2022-02-15 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492616#comment-17492616
 ] 

Maciej Szymkiewicz commented on SPARK-37425:


Hi [~amirkdv]

Just FYI ‒ most of the blockers are already resolved.  It should be able to 
pick pending changes from SPARK-37428 and SPARK-37154 and complete this, or any 
of the remaining ones in mllib.

> Inline type hints for python/pyspark/mllib/recommendation.py
> 
>
> Key: SPARK-37425
> URL: https://issues.apache.org/jira/browse/SPARK-37425
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/recommendation.pyi to 
> python/pyspark/mllib/recommendation.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37413) Inline type hints for python/pyspark/ml/tree.py

2022-02-15 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37413:
--

Assignee: dch nguyen

> Inline type hints for python/pyspark/ml/tree.py
> ---
>
> Key: SPARK-37413
> URL: https://issues.apache.org/jira/browse/SPARK-37413
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: dch nguyen
>Priority: Major
>
> Inline type hints from python/pyspark/ml/tree.pyi to 
> python/pyspark/ml/tree.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37413) Inline type hints for python/pyspark/ml/tree.py

2022-02-15 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37413.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35420
[https://github.com/apache/spark/pull/35420]

> Inline type hints for python/pyspark/ml/tree.py
> ---
>
> Key: SPARK-37413
> URL: https://issues.apache.org/jira/browse/SPARK-37413
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/tree.pyi to 
> python/pyspark/ml/tree.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37428) Inline type hints for python/pyspark/mllib/util.py

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37428:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/mllib/util.py
> --
>
> Key: SPARK-37428
> URL: https://issues.apache.org/jira/browse/SPARK-37428
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/util.pyi to 
> python/pyspark/mllib/util.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37428) Inline type hints for python/pyspark/mllib/util.py

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37428:


Assignee: Apache Spark

> Inline type hints for python/pyspark/mllib/util.py
> --
>
> Key: SPARK-37428
> URL: https://issues.apache.org/jira/browse/SPARK-37428
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/util.pyi to 
> python/pyspark/mllib/util.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37428) Inline type hints for python/pyspark/mllib/util.py

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492599#comment-17492599
 ] 

Apache Spark commented on SPARK-37428:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35532

> Inline type hints for python/pyspark/mllib/util.py
> --
>
> Key: SPARK-37428
> URL: https://issues.apache.org/jira/browse/SPARK-37428
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/util.pyi to 
> python/pyspark/mllib/util.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38199) Delete the unused `dataType` specified in the definition of `IntervalColumnAccessor`

2022-02-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38199:


Assignee: Yang Jie

> Delete the unused `dataType` specified in the definition of 
> `IntervalColumnAccessor`
> 
>
> Key: SPARK-38199
> URL: https://issues.apache.org/jira/browse/SPARK-38199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> SPARK-30066 introduce `IntervalColumnAccessor` and it  accepts 2 constructor 
> parameters: `buffer` and `dataType`, the `dataType` is unused because the 
> parameter passed to `BasicColumnAccessor` is a constant `CALENDAR_INTERVAL`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38199) Delete the unused `dataType` specified in the definition of `IntervalColumnAccessor`

2022-02-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38199.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35507
[https://github.com/apache/spark/pull/35507]

> Delete the unused `dataType` specified in the definition of 
> `IntervalColumnAccessor`
> 
>
> Key: SPARK-38199
> URL: https://issues.apache.org/jira/browse/SPARK-38199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> SPARK-30066 introduce `IntervalColumnAccessor` and it  accepts 2 constructor 
> parameters: `buffer` and `dataType`, the `dataType` is unused because the 
> parameter passed to `BasicColumnAccessor` is a constant `CALENDAR_INTERVAL`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38219:


Assignee: Apache Spark

> Support ANSI aggregation function percentile_cont as window function
> 
>
> Key: SPARK-38219
> URL: https://issues.apache.org/jira/browse/SPARK-38219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> percentile_cont is an aggregate function, some database support it as window 
> function.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38219:


Assignee: (was: Apache Spark)

> Support ANSI aggregation function percentile_cont as window function
> 
>
> Key: SPARK-38219
> URL: https://issues.apache.org/jira/browse/SPARK-38219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> percentile_cont is an aggregate function, some database support it as window 
> function.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492571#comment-17492571
 ] 

Apache Spark commented on SPARK-38219:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35531

> Support ANSI aggregation function percentile_cont as window function
> 
>
> Key: SPARK-38219
> URL: https://issues.apache.org/jira/browse/SPARK-38219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> percentile_cont is an aggregate function, some database support it as window 
> function.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492572#comment-17492572
 ] 

Apache Spark commented on SPARK-38219:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35531

> Support ANSI aggregation function percentile_cont as window function
> 
>
> Key: SPARK-38219
> URL: https://issues.apache.org/jira/browse/SPARK-38219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> percentile_cont is an aggregate function, some database support it as window 
> function.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function

2022-02-15 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-38219:
--

 Summary: Support ANSI aggregation function percentile_cont as 
window function
 Key: SPARK-38219
 URL: https://issues.apache.org/jira/browse/SPARK-38219
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng


percentile_cont is an aggregate function, some database support it as window 
function.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492558#comment-17492558
 ] 

Apache Spark commented on SPARK-37405:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/35530

> Inline type hints for python/pyspark/ml/feature.py
> --
>
> Key: SPARK-37405
> URL: https://issues.apache.org/jira/browse/SPARK-37405
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/feature.pyi to 
> python/pyspark/ml/feature.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492559#comment-17492559
 ] 

Apache Spark commented on SPARK-37405:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/35530

> Inline type hints for python/pyspark/ml/feature.py
> --
>
> Key: SPARK-37405
> URL: https://issues.apache.org/jira/browse/SPARK-37405
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/feature.pyi to 
> python/pyspark/ml/feature.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37405:


Assignee: Apache Spark

> Inline type hints for python/pyspark/ml/feature.py
> --
>
> Key: SPARK-37405
> URL: https://issues.apache.org/jira/browse/SPARK-37405
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/ml/feature.pyi to 
> python/pyspark/ml/feature.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37405:


Assignee: Apache Spark

> Inline type hints for python/pyspark/ml/feature.py
> --
>
> Key: SPARK-37405
> URL: https://issues.apache.org/jira/browse/SPARK-37405
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/ml/feature.pyi to 
> python/pyspark/ml/feature.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37405) Inline type hints for python/pyspark/ml/feature.py

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37405:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/ml/feature.py
> --
>
> Key: SPARK-37405
> URL: https://issues.apache.org/jira/browse/SPARK-37405
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/feature.pyi to 
> python/pyspark/ml/feature.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Mehul Batra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mehul Batra updated SPARK-38218:

Attachment: Screenshot_20220214-013156.jpg

> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
> Attachments: Screenshot_20220214-013156.jpg
>
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Mehul Batra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mehul Batra updated SPARK-38218:

Description: 
!https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!

!https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!

Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
version only?
if yes is hadoop comes with the S3 magic commitor support.

  was:
!https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!

Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
version only?
if yes is hadoop comes with the S3 magic commitor support.


> Looks like the wrong package is available on the spark downloads page. The 
> name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2
> -
>
> Key: SPARK-38218
> URL: https://issues.apache.org/jira/browse/SPARK-38218
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Mehul Batra
>Priority: Major
>
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> !https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!
> Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
> version only?
> if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38218) Looks like the wrong package is available on the spark downloads page. The name reads pre built for hadoop3.3 but the tgz file is marked as hadoop3.2

2022-02-15 Thread Mehul Batra (Jira)
Mehul Batra created SPARK-38218:
---

 Summary: Looks like the wrong package is available on the spark 
downloads page. The name reads pre built for hadoop3.3 but the tgz file is 
marked as hadoop3.2
 Key: SPARK-38218
 URL: https://issues.apache.org/jira/browse/SPARK-38218
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 3.2.1
Reporter: Mehul Batra


!https://files.slack.com/files-pri/T4S1WH2J3-F032FA551U7/screenshot_20220214-013156.jpg!

Does the tgz have Hadoop 3.3 but it was written wrong or it is 3.2 Hadoop 
version only?
if yes is hadoop comes with the S3 magic commitor support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37913) Null Pointer Exception when Loading ML Pipeline Model with Custom Transformer

2022-02-15 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492517#comment-17492517
 ] 

zhengruifeng commented on SPARK-37913:
--

does the `MyTransformer` in the example works?

> Null Pointer Exception when Loading ML Pipeline Model with Custom Transformer
> -
>
> Key: SPARK-37913
> URL: https://issues.apache.org/jira/browse/SPARK-37913
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
> Environment: Spark 3.1.2, Scala 2.12, Java 11
>Reporter: Alana Young
>Priority: Critical
>  Labels: MLPipelineModels, MLPipelines
>
> I am trying to create and persist a ML pipeline model using a custom Spark 
> transformer that I created based on the [Unary Transformer 
> example|https://github.com/apache/spark/blob/v3.1.2/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala]
>  provided by Spark. I am able to save and load the transformer. When I 
> include the custom transformer as a stage in a pipeline model, I can save the 
> model, but am unable to load it. Here is the stack trace of the exception:
>  
> {code:java}
> 01-14-2022 03:49:52 PM ERROR Instrumentation: java.lang.NullPointerException 
> at java.base/java.lang.reflect.Method.invoke(Method.java:559) at 
> org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:631)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268) at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) at 
> org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) at 
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
>  at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355) at 
> org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355) at 
> org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337) at 
> com.dtech.scala.pipeline.PipelineProcess.process(PipelineProcess.scala:122) 
> at com.dtech.scala.pipeline.PipelineProcess$.main(PipelineProcess.scala:448) 
> at com.dtech.scala.pipeline.PipelineProcess.main(PipelineProcess.scala) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65) at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala){code}
>  
> *Source Code*
> [Unary 
> Transformer|https://gist.github.com/ally1221/ff10ec50f7ef98fb6cd365172195bfd5]
> [Persist Unary Transformer & Pipeline 
> Model|https://gist.github.com/ally1221/42473cdc818a8cf795ac78d65d48ee14]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38112) Use error classes in the execution errors of date/timestamp handling

2022-02-15 Thread huangtengfei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492514#comment-17492514
 ] 

huangtengfei commented on SPARK-38112:
--

I will work on this. Thanks [~maxgekk]

> Use error classes in the execution errors of date/timestamp handling
> 
>
> Key: SPARK-38112
> URL: https://issues.apache.org/jira/browse/SPARK-38112
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * sparkUpgradeInReadingDatesError
> * sparkUpgradeInWritingDatesError
> * timeZoneIdNotSpecifiedForTimestampTypeError
> * cannotConvertOrcTimestampToTimestampNTZError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38124) Revive HashClusteredDistribution and apply to stream-stream join

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492506#comment-17492506
 ] 

Apache Spark commented on SPARK-38124:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/35529

> Revive HashClusteredDistribution and apply to stream-stream join
> 
>
> Key: SPARK-38124
> URL: https://issues.apache.org/jira/browse/SPARK-38124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Blocker
> Fix For: 3.3.0
>
>
> SPARK-35703 removed HashClusteredDistribution and replaced its usages with 
> ClusteredDistribution.
> While this works great for non stateful operators, we still need to have a 
> separate requirement of distribution for stateful operator, because the 
> requirement of ClusteredDistribution is too relaxed while the requirement of 
> physical partitioning on stateful operator is quite strict.
> In most cases, stateful operators must require child distribution as 
> HashClusteredDistribution, with below major assumptions:
>  # HashClusteredDistribution creates HashPartitioning and we will never ever 
> change it for the future.
>  # We will never ever change the implementation of {{partitionIdExpression}} 
> in HashPartitioning for the future, so that Partitioner will behave 
> consistently across Spark versions.
>  # No partitioning except HashPartitioning can satisfy 
> HashClusteredDistribution.
>  
> We should revive HashClusteredDistribution (with probably renaming 
> specifically with stateful operator) and apply the distribution to the all 
> stateful operators.
> SPARK-35703 only touched stream-stream join, which means stream-stream join 
> hasn't been broken in actual releases. Let's aim the partial revert of 
> SPARK-35703 in this ticket, and have another ticket to deal with other 
> stateful operators, which have been broken for their introduction (2.2+).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38124) Revive HashClusteredDistribution and apply to stream-stream join

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492505#comment-17492505
 ] 

Apache Spark commented on SPARK-38124:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/35529

> Revive HashClusteredDistribution and apply to stream-stream join
> 
>
> Key: SPARK-38124
> URL: https://issues.apache.org/jira/browse/SPARK-38124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Blocker
> Fix For: 3.3.0
>
>
> SPARK-35703 removed HashClusteredDistribution and replaced its usages with 
> ClusteredDistribution.
> While this works great for non stateful operators, we still need to have a 
> separate requirement of distribution for stateful operator, because the 
> requirement of ClusteredDistribution is too relaxed while the requirement of 
> physical partitioning on stateful operator is quite strict.
> In most cases, stateful operators must require child distribution as 
> HashClusteredDistribution, with below major assumptions:
>  # HashClusteredDistribution creates HashPartitioning and we will never ever 
> change it for the future.
>  # We will never ever change the implementation of {{partitionIdExpression}} 
> in HashPartitioning for the future, so that Partitioner will behave 
> consistently across Spark versions.
>  # No partitioning except HashPartitioning can satisfy 
> HashClusteredDistribution.
>  
> We should revive HashClusteredDistribution (with probably renaming 
> specifically with stateful operator) and apply the distribution to the all 
> stateful operators.
> SPARK-35703 only touched stream-stream join, which means stream-stream join 
> hasn't been broken in actual releases. Let's aim the partial revert of 
> SPARK-35703 in this ticket, and have another ticket to deal with other 
> stateful operators, which have been broken for their introduction (2.2+).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38217) insert overwrite failed for external table with dynamic partition table

2022-02-15 Thread YuanGuanhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YuanGuanhu updated SPARK-38217:
---
Affects Version/s: (was: 3.3.0)

> insert overwrite failed for external table with dynamic partition table
> ---
>
> Key: SPARK-38217
> URL: https://issues.apache.org/jira/browse/SPARK-38217
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: YuanGuanhu
>Priority: Major
>
> can't insert overwrite dynamic partition table, reproduce step with 
> spark3.2.1 hadoop 3.2:
> sql("CREATE EXTERNAL TABLE exttb01(id int) PARTITIONED BY (p1 string, p2 
> string) STORED AS PARQUET LOCATION '/tmp/exttb01'")
> sql("set spark.sql.hive.convertMetastoreParquet=false")
> sql("set hive.exec.dynamic.partition.mode=nonstrict")
> val insertsql = "INSERT OVERWRITE TABLE exttb01 PARTITION(p1='n1', p2) SELECT 
> * FROM VALUES (1, 'n2'), (2, 'n3'), (3, 'n4') AS t(id, p2)"
> sql(insertsql)
> sql(insertsql)
> when execute insert overwrite 2th time, it failed
>  
> WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n4 cannot be cleaned: 
> java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not 
> exist
> java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not 
> exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657)
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929)
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n3 cannot 
> be cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 
> does not exist
> java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 does not 
> exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657)
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929)
>         at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n2 cannot 
> be cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 
> does not exist
> java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 does not 
> exist
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440)
>         at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657)
>         at 

[jira] [Updated] (SPARK-38217) insert overwrite failed for external table with dynamic partition table

2022-02-15 Thread YuanGuanhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YuanGuanhu updated SPARK-38217:
---
Description: 
can't insert overwrite dynamic partition table, reproduce step with spark3.2.1 
hadoop 3.2:

sql("CREATE EXTERNAL TABLE exttb01(id int) PARTITIONED BY (p1 string, p2 
string) STORED AS PARQUET LOCATION '/tmp/exttb01'")
sql("set spark.sql.hive.convertMetastoreParquet=false")
sql("set hive.exec.dynamic.partition.mode=nonstrict")
val insertsql = "INSERT OVERWRITE TABLE exttb01 PARTITION(p1='n1', p2) SELECT * 
FROM VALUES (1, 'n2'), (2, 'n3'), (3, 'n4') AS t(id, p2)"
sql(insertsql)
sql(insertsql)

when execute insert overwrite 2th time, it failed

 

WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n4 cannot be cleaned: 
java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not exist
java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n4 does not exist
        at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
        at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440)
        at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657)
        at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929)
        at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n3 cannot be 
cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 does 
not exist
java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n3 does not exist
        at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
        at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440)
        at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657)
        at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929)
        at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
22/02/15 17:59:19 WARN Hive: Directory file:/tmp/exttb01/p1=n1/p2=n2 cannot be 
cleaned: java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 does 
not exist
java.io.FileNotFoundException: File file:/tmp/exttb01/p1=n1/p2=n2 does not exist
        at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:597)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
        at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3440)
        at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1657)
        at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929)
        at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

  was:can't insert overwrite dynamic partition table, reproduce step:


> insert overwrite failed for external table with dynamic partition table
> 

[jira] [Created] (SPARK-38217) insert overwrite failed for external table with dynamic partition table

2022-02-15 Thread YuanGuanhu (Jira)
YuanGuanhu created SPARK-38217:
--

 Summary: insert overwrite failed for external table with dynamic 
partition table
 Key: SPARK-38217
 URL: https://issues.apache.org/jira/browse/SPARK-38217
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.3.0
Reporter: YuanGuanhu


can't insert overwrite dynamic partition table, reproduce step:



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38215) InsertIntoHiveDir support convert metadata

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492490#comment-17492490
 ] 

Apache Spark commented on SPARK-38215:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/35528

> InsertIntoHiveDir support convert metadata
> --
>
> Key: SPARK-38215
> URL: https://issues.apache.org/jira/browse/SPARK-38215
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Priority: Major
>
> Current InsertIntoHiveDir command use hive serde write data, con't supporot 
> convert, cause such SQL can't write  parquet with zstd.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38215) InsertIntoHiveDir support convert metadata

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38215:


Assignee: Apache Spark

> InsertIntoHiveDir support convert metadata
> --
>
> Key: SPARK-38215
> URL: https://issues.apache.org/jira/browse/SPARK-38215
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Current InsertIntoHiveDir command use hive serde write data, con't supporot 
> convert, cause such SQL can't write  parquet with zstd.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38215) InsertIntoHiveDir support convert metadata

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38215:


Assignee: (was: Apache Spark)

> InsertIntoHiveDir support convert metadata
> --
>
> Key: SPARK-38215
> URL: https://issues.apache.org/jira/browse/SPARK-38215
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: angerszhu
>Priority: Major
>
> Current InsertIntoHiveDir command use hive serde write data, con't supporot 
> convert, cause such SQL can't write  parquet with zstd.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492489#comment-17492489
 ] 

Apache Spark commented on SPARK-38216:
--

User 'Yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/35527

> When creating a Hive table, fail early if all the columns are partitioned 
> columns
> -
>
> Key: SPARK-38216
> URL: https://issues.apache.org/jira/browse/SPARK-38216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> In Hive the schema and partition columns must be disjoint sets, if hive table 
> which all columns are partitioned columns, so that other columns is empty, it 
> will fail when Hive create table, error msg as follow:
> `
> throw new HiveException(
> "at least one column must be specified for the table")
> `
> So when creating a Hive table, fail early if all the columns are partitioned 
> columns, 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492488#comment-17492488
 ] 

Apache Spark commented on SPARK-38216:
--

User 'Yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/35527

> When creating a Hive table, fail early if all the columns are partitioned 
> columns
> -
>
> Key: SPARK-38216
> URL: https://issues.apache.org/jira/browse/SPARK-38216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> In Hive the schema and partition columns must be disjoint sets, if hive table 
> which all columns are partitioned columns, so that other columns is empty, it 
> will fail when Hive create table, error msg as follow:
> `
> throw new HiveException(
> "at least one column must be specified for the table")
> `
> So when creating a Hive table, fail early if all the columns are partitioned 
> columns, 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38216:


Assignee: (was: Apache Spark)

> When creating a Hive table, fail early if all the columns are partitioned 
> columns
> -
>
> Key: SPARK-38216
> URL: https://issues.apache.org/jira/browse/SPARK-38216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> In Hive the schema and partition columns must be disjoint sets, if hive table 
> which all columns are partitioned columns, so that other columns is empty, it 
> will fail when Hive create table, error msg as follow:
> `
> throw new HiveException(
> "at least one column must be specified for the table")
> `
> So when creating a Hive table, fail early if all the columns are partitioned 
> columns, 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38216:


Assignee: Apache Spark

> When creating a Hive table, fail early if all the columns are partitioned 
> columns
> -
>
> Key: SPARK-38216
> URL: https://issues.apache.org/jira/browse/SPARK-38216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> In Hive the schema and partition columns must be disjoint sets, if hive table 
> which all columns are partitioned columns, so that other columns is empty, it 
> will fail when Hive create table, error msg as follow:
> `
> throw new HiveException(
> "at least one column must be specified for the table")
> `
> So when creating a Hive table, fail early if all the columns are partitioned 
> columns, 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns

2022-02-15 Thread yikf (Jira)
yikf created SPARK-38216:


 Summary: When creating a Hive table, fail early if all the columns 
are partitioned columns
 Key: SPARK-38216
 URL: https://issues.apache.org/jira/browse/SPARK-38216
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: yikf
 Fix For: 3.3.0


In Hive the schema and partition columns must be disjoint sets, if hive table 
which all columns are partitioned columns, so that other columns is empty, it 
will fail when Hive create table, such as 

`

throw new HiveException(
"at least one column must be specified for the table")

`

So when creating a Hive table, fail early if all the columns are partitioned 
columns, 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns

2022-02-15 Thread yikf (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-38216:
-
Description: 
In Hive the schema and partition columns must be disjoint sets, if hive table 
which all columns are partitioned columns, so that other columns is empty, it 
will fail when Hive create table, error msg as follow:

`

throw new HiveException(
"at least one column must be specified for the table")

`

So when creating a Hive table, fail early if all the columns are partitioned 
columns, 

  was:
In Hive the schema and partition columns must be disjoint sets, if hive table 
which all columns are partitioned columns, so that other columns is empty, it 
will fail when Hive create table, such as 

`

throw new HiveException(
"at least one column must be specified for the table")

`

So when creating a Hive table, fail early if all the columns are partitioned 
columns, 


> When creating a Hive table, fail early if all the columns are partitioned 
> columns
> -
>
> Key: SPARK-38216
> URL: https://issues.apache.org/jira/browse/SPARK-38216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> In Hive the schema and partition columns must be disjoint sets, if hive table 
> which all columns are partitioned columns, so that other columns is empty, it 
> will fail when Hive create table, error msg as follow:
> `
> throw new HiveException(
> "at least one column must be specified for the table")
> `
> So when creating a Hive table, fail early if all the columns are partitioned 
> columns, 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38215) InsertIntoHiveDir support convert metadata

2022-02-15 Thread angerszhu (Jira)
angerszhu created SPARK-38215:
-

 Summary: InsertIntoHiveDir support convert metadata
 Key: SPARK-38215
 URL: https://issues.apache.org/jira/browse/SPARK-38215
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.1
Reporter: angerszhu


Current InsertIntoHiveDir command use hive serde write data, con't supporot 
convert, cause such SQL can't write  parquet with zstd.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38214) No need to filter data when the sliding window length is not redundant

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38214:


Assignee: (was: Apache Spark)

> No need to filter data when the sliding window length is not redundant
> --
>
> Key: SPARK-38214
> URL: https://issues.apache.org/jira/browse/SPARK-38214
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: nyingping
>Priority: Minor
>
> At present, the sliding window adopts the form of expand + filter, but in 
> some cases, filter is not necessary.
> Filter is required if the sliding window is irregular. When the window length 
> is divided by the slide length the result is an integer (I believe this is 
> also the case for most work scenarios in practice for sliding window), there 
> is no need to filter, which can save calculation resources and improve 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38214) No need to filter data when the sliding window length is not redundant

2022-02-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492460#comment-17492460
 ] 

Apache Spark commented on SPARK-38214:
--

User 'nyingping' has created a pull request for this issue:
https://github.com/apache/spark/pull/35526

> No need to filter data when the sliding window length is not redundant
> --
>
> Key: SPARK-38214
> URL: https://issues.apache.org/jira/browse/SPARK-38214
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: nyingping
>Priority: Minor
>
> At present, the sliding window adopts the form of expand + filter, but in 
> some cases, filter is not necessary.
> Filter is required if the sliding window is irregular. When the window length 
> is divided by the slide length the result is an integer (I believe this is 
> also the case for most work scenarios in practice for sliding window), there 
> is no need to filter, which can save calculation resources and improve 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38214) No need to filter data when the sliding window length is not redundant

2022-02-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38214:


Assignee: Apache Spark

> No need to filter data when the sliding window length is not redundant
> --
>
> Key: SPARK-38214
> URL: https://issues.apache.org/jira/browse/SPARK-38214
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: nyingping
>Assignee: Apache Spark
>Priority: Minor
>
> At present, the sliding window adopts the form of expand + filter, but in 
> some cases, filter is not necessary.
> Filter is required if the sliding window is irregular. When the window length 
> is divided by the slide length the result is an integer (I believe this is 
> also the case for most work scenarios in practice for sliding window), there 
> is no need to filter, which can save calculation resources and improve 
> performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-15 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492452#comment-17492452
 ] 

Kousuke Saruta commented on SPARK-36808:


Ah, O.K. I misunderstood. I'll withdraw the PRs.




> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38214) No need to filter data when the sliding window length is not redundant

2022-02-15 Thread nyingping (Jira)
nyingping created SPARK-38214:
-

 Summary: No need to filter data when the sliding window length is 
not redundant
 Key: SPARK-38214
 URL: https://issues.apache.org/jira/browse/SPARK-38214
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.2.1
Reporter: nyingping


At present, the sliding window adopts the form of expand + filter, but in some 
cases, filter is not necessary.

Filter is required if the sliding window is irregular. When the window length 
is divided by the slide length the result is an integer (I believe this is also 
the case for most work scenarios in practice for sliding window), there is no 
need to filter, which can save calculation resources and improve performance.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492440#comment-17492440
 ] 

Dongjoon Hyun commented on SPARK-36808:
---

I approved the first one and added comments for the other two PRs.

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36808) Upgrade Kafka to 2.8.1

2022-02-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492437#comment-17492437
 ] 

Dongjoon Hyun commented on SPARK-36808:
---

Please consider `branch-3.2` only.

> Upgrade Kafka to 2.8.1
> --
>
> Key: SPARK-36808
> URL: https://issues.apache.org/jira/browse/SPARK-36808
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> A few hours ago, Kafka 2.8.1 was released, which includes a bunch of bug fix.
> https://downloads.apache.org/kafka/2.8.1/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >