[GitHub] [spark] zhengruifeng commented on pull request #40322: [SPARK-41775][PYTHON][FOLLOW-UP] Updating error message for training using PyTorch functions

2023-03-07 Thread via GitHub
zhengruifeng commented on PR #40322: URL: https://github.com/apache/spark/pull/40322#issuecomment-1459625070 merged into master/branch-3.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] zhengruifeng closed pull request #40322: [SPARK-41775][PYTHON][FOLLOW-UP] Updating error message for training using PyTorch functions

2023-03-07 Thread via GitHub
zhengruifeng closed pull request #40322: [SPARK-41775][PYTHON][FOLLOW-UP] Updating error message for training using PyTorch functions URL: https://github.com/apache/spark/pull/40322 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] mridulm commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub
mridulm commented on PR #40307: URL: https://github.com/apache/spark/pull/40307#issuecomment-1459618430 @jerqi the basic issue here is, `getPreferredLocations` in `ShuffledRowRDD` should return `Nil` at the very beginning in case `spark.shuffle.reduceLocality.enabled = false` We

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1127834899 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/Serializer.scala: ## @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

2023-03-07 Thread via GitHub
MaxGekk commented on code in PR #40126: URL: https://github.com/apache/spark/pull/40126#discussion_r1129047990 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -465,7 +465,20 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] LuciferYang commented on pull request #40317: [SPARK-42700][BUILD] Add `h2` as test dependency of connect-server module

2023-03-07 Thread via GitHub
LuciferYang commented on PR #40317: URL: https://github.com/apache/spark/pull/40317#issuecomment-1459646333 Thanks @HyukjinKwon @hvanhovell @dongjoon-hyun @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] xinrong-meng opened a new pull request, #40330: [SPARK-42712][PYTHON][DOC] Improve docstring of mapInPandas and mapInArrow

2023-03-07 Thread via GitHub
xinrong-meng opened a new pull request, #40330: URL: https://github.com/apache/spark/pull/40330 ### What changes were proposed in this pull request? Improve docstring of mapInPandas and mapInArrow ### Why are the changes needed? For readability. We call out they are not scalar

[GitHub] [spark] xinrong-meng commented on pull request #40329: [SPARK-42710][CONNECT][PYTHON] Rename FrameMap proto to MapPartitions

2023-03-07 Thread via GitHub
xinrong-meng commented on PR #40329: URL: https://github.com/apache/spark/pull/40329#issuecomment-1459636445 CC @HyukjinKwon @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] yaooqinn commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub
yaooqinn commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1459654900 You first defined a case-sensitive data set, then queried in a case-insensitive way, I guess the error is expected. -- This is an automated message from the Apache Git Service. To

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1459668923 > You first defined a case-sensitive data set, then queried in a case-insensitive way, I guess the error is expected. In the physical plan, both id and ID columns are projected to

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
zhengruifeng commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1129079073 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala: ## @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] zhengruifeng opened a new pull request, #40331: [SPARK-42713][PYTHON][DOCS] Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference

2023-03-07 Thread via GitHub
zhengruifeng opened a new pull request, #40331: URL: https://github.com/apache/spark/pull/40331 ### What changes were proposed in this pull request? Add '__getattr__' and '__getitem__' of DataFrame and Column to API reference ### Why are the changes needed? '__getattr__'

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1459652942 > Can you try `set spark.sql.caseSensitive=true`? Yes, I have tried it. With caseSensitive set to false, it will work as then id and ID will be treated as separate columns.

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1129073675 ## connector/connect/common/src/main/protobuf/spark/connect/ml.proto: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1129072923 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala: ## @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1129073121 ## connector/connect/common/src/main/protobuf/spark/connect/ml.proto: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub
jerqi commented on PR #40307: URL: https://github.com/apache/spark/pull/40307#issuecomment-1459630519 > Could I raise another pr to fix this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub
jerqi commented on PR #40307: URL: https://github.com/apache/spark/pull/40307#issuecomment-1459630906 > @jerqi the basic issue here is, `getPreferredLocations` in `ShuffledRowRDD` should return `Nil` at the very beginning in case `spark.shuffle.reduceLocality.enabled = false`

[GitHub] [spark] mridulm commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub
mridulm commented on PR #40307: URL: https://github.com/apache/spark/pull/40307#issuecomment-1459652457 Sure ! Please go ahead :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1129073881 ## connector/connect/common/src/main/protobuf/spark/connect/ml.proto: ## @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or

[GitHub] [spark] xinrong-meng commented on pull request #40330: [SPARK-42712][PYTHON][DOC] Improve docstring of mapInPandas and mapInArrow

2023-03-07 Thread via GitHub
xinrong-meng commented on PR #40330: URL: https://github.com/apache/spark/pull/40330#issuecomment-1459635902 CC @HyukjinKwon @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] yaooqinn commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub
yaooqinn commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1459648316 Can you try `set spark.sql.caseSensitive=true`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [spark] peter-toth commented on a diff in pull request #40268: [SPARK-42500][SQL] ConstantPropagation support more cases

2023-03-07 Thread via GitHub
peter-toth commented on code in PR #40268: URL: https://github.com/apache/spark/pull/40268#discussion_r1127512831 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -112,16 +113,13 @@ object ConstantFolding extends Rule[LogicalPlan]

[GitHub] [spark] yaooqinn opened a new pull request, #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-07 Thread via GitHub
yaooqinn opened a new pull request, #40313: URL: https://github.com/apache/spark/pull/40313 ### What changes were proposed in this pull request? Fix /api/v1/applications to return total uptime instead of 0 for duration ### Why are the changes needed? Fix

[GitHub] [spark] AngersZhuuuu opened a new pull request, #40315: [SPARK-42699][CONNECTOR] SparkConnectServer should make client and AM same exit code

2023-03-07 Thread via GitHub
AngersZh opened a new pull request, #40315: URL: https://github.com/apache/spark/pull/40315 ### What changes were proposed in this pull request? Since in https://github.com/apache/spark/pull/35594 we support pass a exit code to AM, when SparkConnectServer exit with -1, need pass

[GitHub] [spark] AngersZhuuuu commented on pull request #40315: [SPARK-42699][CONNECTOR] SparkConnectServer should make client and AM same exit code

2023-03-07 Thread via GitHub
AngersZh commented on PR #40315: URL: https://github.com/apache/spark/pull/40315#issuecomment-1457984092 ping @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub
jerqi commented on PR #40307: URL: https://github.com/apache/spark/pull/40307#issuecomment-1457723960 > @jerqi locality may still have benefits when RSS works in hybrid deployments, besides, there is a dedicated configuration for that `spark.shuffle.reduceLocality.enabled`

[GitHub] [spark] AngersZhuuuu commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side

2023-03-07 Thread via GitHub
AngersZh commented on PR #40314: URL: https://github.com/apache/spark/pull/40314#issuecomment-1457967882 ping @cloud-fan @dongjoon-hyun @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] zhengruifeng commented on pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-07 Thread via GitHub
zhengruifeng commented on PR #40097: URL: https://github.com/apache/spark/pull/40097#issuecomment-1457733081 @WeichenXu123 I think it is ready for review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] alkis commented on pull request #40302: [SPARK-42686][CORE] Defer formatting for debug messages in TaskMemoryManager

2023-03-07 Thread via GitHub
alkis commented on PR #40302: URL: https://github.com/apache/spark/pull/40302#issuecomment-1457984961 > Mind retriggering https://github.com/alkis/spark/runs/11797022157? Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] peter-toth commented on a diff in pull request #40268: [SPARK-42500][SQL] ConstantPropagation support more cases

2023-03-07 Thread via GitHub
peter-toth commented on code in PR #40268: URL: https://github.com/apache/spark/pull/40268#discussion_r1127511901 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -138,56 +136,53 @@ object ConstantPropagation extends

[GitHub] [spark] FurcyPin commented on a diff in pull request #40271: SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-07 Thread via GitHub
FurcyPin commented on code in PR #40271: URL: https://github.com/apache/spark/pull/40271#discussion_r1127557744 ## python/pyspark/sql/tests/test_functions.py: ## @@ -1268,6 +1268,12 @@ def test_bucket(self): message_parameters={"arg_name": "numBuckets", "arg_type":

[GitHub] [spark] AngersZhuuuu opened a new pull request, #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side

2023-03-07 Thread via GitHub
AngersZh opened a new pull request, #40314: URL: https://github.com/apache/spark/pull/40314 ### What changes were proposed in this pull request? Currently when we run client mode SparkSubmit, when we catch exception during `runMain()` It just calls `sc.stop()`, then AM still exit

[GitHub] [spark] xingchaozh opened a new pull request, #40312: [SPARK-42695][SQL] Skew join handling in stream side of broadcast hash join

2023-03-07 Thread via GitHub
xingchaozh opened a new pull request, #40312: URL: https://github.com/apache/spark/pull/40312 ### What changes were proposed in this pull request? We could handle the steam side skew of BroadcastHashJoin to improve the join performance Before | After -- | --

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40270: [SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function.

2023-03-07 Thread via GitHub
HyukjinKwon commented on code in PR #40270: URL: https://github.com/apache/spark/pull/40270#discussion_r1127644938 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -781,3 +782,10 @@ message FrameMap { CommonInlineUserDefinedFunction func =

[GitHub] [spark] HyukjinKwon commented on pull request #40311: [SPARK-42559][CONNECT][TESTS][FOLLOW-UP] Disable ANSI in several tests at DataFrameNaFunctionSuite.scala

2023-03-07 Thread via GitHub
HyukjinKwon commented on PR #40311: URL: https://github.com/apache/spark/pull/40311#issuecomment-1457730273 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] HyukjinKwon closed pull request #40311: [SPARK-42559][CONNECT][TESTS][FOLLOW-UP] Disable ANSI in several tests at DataFrameNaFunctionSuite.scala

2023-03-07 Thread via GitHub
HyukjinKwon closed pull request #40311: [SPARK-42559][CONNECT][TESTS][FOLLOW-UP] Disable ANSI in several tests at DataFrameNaFunctionSuite.scala URL: https://github.com/apache/spark/pull/40311 -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-07 Thread via GitHub
zhengruifeng commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1127499837 ## mllib/common/pom.xml: ## @@ -0,0 +1,109 @@ + + + +http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40270: [SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function.

2023-03-07 Thread via GitHub
HyukjinKwon commented on code in PR #40270: URL: https://github.com/apache/spark/pull/40270#discussion_r1127647073 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -781,3 +782,10 @@ message FrameMap { CommonInlineUserDefinedFunction func =

[GitHub] [spark] HyukjinKwon commented on pull request #40302: [SPARK-42686][CORE] Defer formatting for debug messages in TaskMemoryManager

2023-03-07 Thread via GitHub
HyukjinKwon commented on PR #40302: URL: https://github.com/apache/spark/pull/40302#issuecomment-1457930310 Mind retriggering https://github.com/alkis/spark/runs/11797022157? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40215: [SPARK-42591][SS][DOCS] Add examples of unblocked workloads after SPARK-42376

2023-03-07 Thread via GitHub
HeartSaVioR commented on code in PR #40215: URL: https://github.com/apache/spark/pull/40215#discussion_r1127778137 ## docs/structured-streaming-programming-guide.md: ## @@ -1848,12 +1848,137 @@ Additional details on supported joins: - As of Spark 2.4, you can use joins only

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #40215: [SPARK-42591][SS][DOCS] Add examples of unblocked workloads after SPARK-42376

2023-03-07 Thread via GitHub
HeartSaVioR commented on code in PR #40215: URL: https://github.com/apache/spark/pull/40215#discussion_r1127779011 ## docs/structured-streaming-programming-guide.md: ## @@ -1848,12 +1848,137 @@ Additional details on supported joins: - As of Spark 2.4, you can use joins only

[GitHub] [spark] itholic commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-07 Thread via GitHub
itholic commented on PR #40280: URL: https://github.com/apache/spark/pull/40280#issuecomment-1458111230 Awesome!! Let me take a look at your PR. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [spark] LuciferYang commented on a diff in pull request #40305: [SPARK-42656][CONNECT][Followup] Spark Connect Shell

2023-03-07 Thread via GitHub
LuciferYang commented on code in PR #40305: URL: https://github.com/apache/spark/pull/40305#discussion_r1127825622 ## repl/src/main/scala-2.12/org/apache/spark/repl/Main.scala: ## @@ -121,6 +121,11 @@ object Main extends Logging { sparkContext = sparkSession.sparkContext

[GitHub] [spark] LuciferYang commented on pull request #40317: [SPARK-42700][BUILD] Add `h2` as test dependency of connect-server module

2023-03-07 Thread via GitHub
LuciferYang commented on PR #40317: URL: https://github.com/apache/spark/pull/40317#issuecomment-1458156395 > this is the 101st we have broken the maven build in the last month alone. We don't test with it, but we do feel comfortable to release with it. Are we sure the dual build setup is

[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2023-03-07 Thread via GitHub
shrprasa commented on PR #37880: URL: https://github.com/apache/spark/pull/37880#issuecomment-1458162368 @holdenk Thanks for approving the PR. Can you please merge this PR or tag someone who can do it? -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] hvanhovell commented on a diff in pull request #40319: [SPARK-42692][CONNECT] Implement `Dataset.toJSON`

2023-03-07 Thread via GitHub
hvanhovell commented on code in PR #40319: URL: https://github.com/apache/spark/pull/40319#discussion_r1128015075 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2777,7 +2778,11 @@ class Dataset[T] private[sql] ( } def toJSON:

[GitHub] [spark] panbingkun opened a new pull request, #40316: [WIP][SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

2023-03-07 Thread via GitHub
panbingkun opened a new pull request, #40316: URL: https://github.com/apache/spark/pull/40316 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch

[GitHub] [spark] LuciferYang commented on a diff in pull request #40318: [SPARK-42656][SPARK SHELL][CONNECT][FOLLOWUP] Add same `ClassNotFoundException` catch to `repl.Main` for Scala 2.13

2023-03-07 Thread via GitHub
LuciferYang commented on code in PR #40318: URL: https://github.com/apache/spark/pull/40318#discussion_r1127827370 ## repl/src/main/scala-2.13/org/apache/spark/repl/Main.scala: ## @@ -129,6 +129,11 @@ object Main extends Logging { sparkContext = sparkSession.sparkContext

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub
shrprasa commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1458159825 > I'm not sure about the change, not sure I'm qualified to review it. I think at best the error message should change; I am not clear that the result is 'wrong' Thanks for

[GitHub] [spark] hvanhovell commented on pull request #40276: [SPARK-42630][CONNECT][PYTHON] Implement data type string parser

2023-03-07 Thread via GitHub
hvanhovell commented on PR #40276: URL: https://github.com/apache/spark/pull/40276#issuecomment-1458274759 At the end of the day it is an optimization. However I do it is a sound one to have. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [spark] cloud-fan commented on a diff in pull request #40190: [SPARK-42597][SQL] Support unwrap date type to timestamp type

2023-03-07 Thread via GitHub
cloud-fan commented on code in PR #40190: URL: https://github.com/apache/spark/pull/40190#discussion_r1127955295 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala: ## @@ -350,7 +384,7 @@ object UnwrapCastInBinaryComparison

[GitHub] [spark] cloud-fan commented on a diff in pull request #40294: [SPARK-40610][SQL] Support unwrap date type to string type

2023-03-07 Thread via GitHub
cloud-fan commented on code in PR #40294: URL: https://github.com/apache/spark/pull/40294#discussion_r1127959943 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala: ## @@ -133,6 +133,11 @@ object

[GitHub] [spark] LuciferYang opened a new pull request, #40319: [SPARK-42692][CONNECT]

2023-03-07 Thread via GitHub
LuciferYang opened a new pull request, #40319: URL: https://github.com/apache/spark/pull/40319 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ###

[GitHub] [spark] hvanhovell closed pull request #40318: [SPARK-42656][SPARK SHELL][CONNECT][FOLLOWUP] Add same `ClassNotFoundException` catch to `repl.Main` for Scala 2.13

2023-03-07 Thread via GitHub
hvanhovell closed pull request #40318: [SPARK-42656][SPARK SHELL][CONNECT][FOLLOWUP] Add same `ClassNotFoundException` catch to `repl.Main` for Scala 2.13 URL: https://github.com/apache/spark/pull/40318 -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] hvanhovell commented on a diff in pull request #40319: [SPARK-42692][CONNECT] Implement `Dataset.toJSON`

2023-03-07 Thread via GitHub
hvanhovell commented on code in PR #40319: URL: https://github.com/apache/spark/pull/40319#discussion_r1128052249 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2777,7 +2778,11 @@ class Dataset[T] private[sql] ( } def toJSON:

[GitHub] [spark] hvanhovell commented on a diff in pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-07 Thread via GitHub
hvanhovell commented on code in PR #40291: URL: https://github.com/apache/spark/pull/40291#discussion_r1128061371 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala: ## @@ -345,6 +345,48 @@ final class DataFrameWriter[T] private[sql] (ds:

[GitHub] [spark] panbingkun commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-07 Thread via GitHub
panbingkun commented on PR #40280: URL: https://github.com/apache/spark/pull/40280#issuecomment-1458081353 > Thanks, @panbingkun ! By the way, I think this issue has a pretty high priority since the default nullability of a schema is `False`. > > ```python > >>> sdf =

[GitHub] [spark] waitinfuture commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub
waitinfuture commented on PR #40307: URL: https://github.com/apache/spark/pull/40307#issuecomment-1458107057 > This is still WIP, but want to get early feedback. +CC @Ngone51, @otterc, @waitinfuture Hi @mridulm , thanks for the work and it really simplifies the usage of Apache

[GitHub] [spark] srowen commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-07 Thread via GitHub
srowen commented on PR #40258: URL: https://github.com/apache/spark/pull/40258#issuecomment-1458138622 I'm not sure about the change, not sure I'm qualified to review it. I think at best the error message should change; I am not clear that the result is 'wrong' -- This is an automated

[GitHub] [spark] LuciferYang opened a new pull request, #40318: [SPARK-42656][SPARK SHELL][CONNECT][FOLLOWUP] Add same `ClassNotFoundException` catch to `repl.Main` for Scala 2.13

2023-03-07 Thread via GitHub
LuciferYang opened a new pull request, #40318: URL: https://github.com/apache/spark/pull/40318 ### What changes were proposed in this pull request? This pr add the same `ClassNotFoundException` catch to `repl.Main` for Scala 2.13 as https://github.com/apache/spark/pull/40305 due

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1127830764 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -81,13 +82,50 @@ message Relation { // Catalog API (experimental /

[GitHub] [spark] cloud-fan commented on pull request #40276: [SPARK-42630][CONNECT][PYTHON] Implement data type string parser

2023-03-07 Thread via GitHub
cloud-fan commented on PR #40276: URL: https://github.com/apache/spark/pull/40276#issuecomment-1458245647 does it mean every spark connect client must implement a data type parser in its language? This seems a bit overkill. Can we revisit all the places that need to parse data type at

[GitHub] [spark] cloud-fan commented on pull request #38358: [SPARK-40588] FileFormatWriter materializes AQE plan before accessing outputOrdering

2023-03-07 Thread via GitHub
cloud-fan commented on PR #38358: URL: https://github.com/apache/spark/pull/38358#issuecomment-1458251751 @wangyum do you know why it's a problem only in 3.2? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [spark] LuciferYang commented on a diff in pull request #40319: [SPARK-42692][CONNECT] Implement `Dataset.toJSON`

2023-03-07 Thread via GitHub
LuciferYang commented on code in PR #40319: URL: https://github.com/apache/spark/pull/40319#discussion_r1128066536 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2777,7 +2778,11 @@ class Dataset[T] private[sql] ( } def toJSON:

[GitHub] [spark] beliefer commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-07 Thread via GitHub
beliefer commented on PR #40287: URL: https://github.com/apache/spark/pull/40287#issuecomment-1458109080 > @beliefer here is the thing. When this was designed it was mainly aimed at sql, and there we definitely do not generate unique names in lambda functions either. This is all done in

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1127841115 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1127841115 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software

[GitHub] [spark] beliefer commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-07 Thread via GitHub
beliefer commented on PR #40287: URL: https://github.com/apache/spark/pull/40287#issuecomment-1458193681 > E... SQL/scala/Python all use the analyzer; they are all just frontends to the same thing. I found the reason. Although the scala API use analyzer too. `object

[GitHub] [spark] justaparth opened a new pull request, #40320: Update code example formatting for protobuf parsing readme

2023-03-07 Thread via GitHub
justaparth opened a new pull request, #40320: URL: https://github.com/apache/spark/pull/40320 ### What changes were proposed in this pull request? I was reviewing this markdown document about proto parsing, and found that the formatting of code blocks looked incorrect: some

[GitHub] [spark] hvanhovell commented on pull request #40305: [SPARK-42656][CONNECT][Followup] Spark Connect Shell

2023-03-07 Thread via GitHub
hvanhovell commented on PR #40305: URL: https://github.com/apache/spark/pull/40305#issuecomment-1458096673 Merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] hvanhovell closed pull request #40305: [SPARK-42656][CONNECT][Followup] Spark Connect Shell

2023-03-07 Thread via GitHub
hvanhovell closed pull request #40305: [SPARK-42656][CONNECT][Followup] Spark Connect Shell URL: https://github.com/apache/spark/pull/40305 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-07 Thread via GitHub
zhengruifeng commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1127788041 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala: ## @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [spark] hvanhovell commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-07 Thread via GitHub
hvanhovell commented on PR #40287: URL: https://github.com/apache/spark/pull/40287#issuecomment-1458116337 E... SQL/scala/Python all use the analyzer; they are all just frontends to the same thing. -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [spark] LuciferYang opened a new pull request, #40317: [SPARK-42700][BUILD] Add `h2` as test dependency of connect-server module

2023-03-07 Thread via GitHub
LuciferYang opened a new pull request, #40317: URL: https://github.com/apache/spark/pull/40317 ### What changes were proposed in this pull request? Run the following commands ``` build/mvn clean install -DskipTests -pl connector/connect/server -am build/mvn test -pl

[GitHub] [spark] hvanhovell commented on a diff in pull request #40315: [SPARK-42699][CONNECTOR] SparkConnectServer should make client and AM same exit code

2023-03-07 Thread via GitHub
hvanhovell commented on code in PR #40315: URL: https://github.com/apache/spark/pull/40315#discussion_r1127848561 ## sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala: ## @@ -736,13 +736,15 @@ class SparkSession private( } // scalastyle:on + def stop():

[GitHub] [spark] cloud-fan commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-07 Thread via GitHub
cloud-fan commented on code in PR #40308: URL: https://github.com/apache/spark/pull/40308#discussion_r1127909983 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager:

[GitHub] [spark] panbingkun commented on pull request #40316: [WIP][SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

2023-03-07 Thread via GitHub
panbingkun commented on PR #40316: URL: https://github.com/apache/spark/pull/40316#issuecomment-1458241313 https://user-images.githubusercontent.com/15246973/223446693-3c296b56-f9aa-4b70-9eb3-5bc9059ba631.png;> -- This is an automated message from the Apache Git Service. To respond to

[GitHub] [spark] LuciferYang commented on a diff in pull request #40319: [SPARK-42692][CONNECT] Implement `Dataset.toJSON`

2023-03-07 Thread via GitHub
LuciferYang commented on code in PR #40319: URL: https://github.com/apache/spark/pull/40319#discussion_r1128042664 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2777,7 +2778,11 @@ class Dataset[T] private[sql] ( } def toJSON:

[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-07 Thread via GitHub
mridulm commented on code in PR #40286: URL: https://github.com/apache/spark/pull/40286#discussion_r1128353043 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4572,6 +4572,48 @@ class DAGSchedulerSuite extends SparkFunSuite with

[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-07 Thread via GitHub
mridulm commented on code in PR #40286: URL: https://github.com/apache/spark/pull/40286#discussion_r1128353043 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4572,6 +4572,48 @@ class DAGSchedulerSuite extends SparkFunSuite with

[GitHub] [spark] hvanhovell commented on a diff in pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-07 Thread via GitHub
hvanhovell commented on code in PR #40291: URL: https://github.com/apache/spark/pull/40291#discussion_r1128152776 ## connector/connect/common/src/main/protobuf/spark/connect/commands.proto: ## @@ -116,6 +116,7 @@ message WriteOperation { TABLE_SAVE_METHOD_UNSPECIFIED =

[GitHub] [spark] mridulm commented on a diff in pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-07 Thread via GitHub
mridulm commented on code in PR #40286: URL: https://github.com/apache/spark/pull/40286#discussion_r1128351289 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4572,6 +4572,48 @@ class DAGSchedulerSuite extends SparkFunSuite with

[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-07 Thread via GitHub
hvanhovell commented on code in PR #40277: URL: https://github.com/apache/spark/pull/40277#discussion_r1128154890 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -140,6 +140,11 @@ message Read { // (Optional) A list of path for

[GitHub] [spark] amaliujia commented on pull request #40319: [SPARK-42692][CONNECT] Implement `Dataset.toJSON`

2023-03-07 Thread via GitHub
amaliujia commented on PR #40319: URL: https://github.com/apache/spark/pull/40319#issuecomment-1458469794 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] otterc commented on a diff in pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub
otterc commented on code in PR #40307: URL: https://github.com/apache/spark/pull/40307#discussion_r1128337921 ## core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala: ## @@ -203,7 +205,8 @@ private[spark] class ExecutorAllocationManager( throw new

[GitHub] [spark] mridulm commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-07 Thread via GitHub
mridulm commented on code in PR #40313: URL: https://github.com/apache/spark/pull/40313#discussion_r1128332164 ## core/src/main/scala/org/apache/spark/ui/SparkUI.scala: ## @@ -167,7 +167,7 @@ private[spark] class SparkUI private ( attemptId = None, startTime =

[GitHub] [spark] sunchao commented on a diff in pull request #40190: [SPARK-42597][SQL] Support unwrap date type to timestamp type

2023-03-07 Thread via GitHub
sunchao commented on code in PR #40190: URL: https://github.com/apache/spark/pull/40190#discussion_r1128241190 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparisonSuite.scala: ## @@ -368,6 +370,61 @@ class

[GitHub] [spark] ryan-johnson-databricks opened a new pull request, #40321: [SPARK-42704] SubqueryAlias propagates metadata columns that child outputs

2023-03-07 Thread via GitHub
ryan-johnson-databricks opened a new pull request, #40321: URL: https://github.com/apache/spark/pull/40321 ### What changes were proposed in this pull request? The `AddMetadataColumns` analyzer rule is designed to resolve metadata columns using

[GitHub] [spark] mridulm commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-07 Thread via GitHub
mridulm commented on code in PR #40313: URL: https://github.com/apache/spark/pull/40313#discussion_r1128332164 ## core/src/main/scala/org/apache/spark/ui/SparkUI.scala: ## @@ -167,7 +167,7 @@ private[spark] class SparkUI private ( attemptId = None, startTime =

[GitHub] [spark] LuciferYang commented on pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub
LuciferYang commented on PR #40323: URL: https://github.com/apache/spark/pull/40323#issuecomment-1459158853 Is there a similar case on Scala connect client ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-07 Thread via GitHub
jerqi commented on PR #40307: URL: https://github.com/apache/spark/pull/40307#issuecomment-1459167441 > @jerqi Agree that we should have a way to specify locality preference for disaggregated shuffle implementations to spark scheduler - so that shuffle tasks are closer to the data. >

[GitHub] [spark] yaooqinn commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-07 Thread via GitHub
yaooqinn commented on code in PR #40313: URL: https://github.com/apache/spark/pull/40313#discussion_r1128885696 ## core/src/main/scala/org/apache/spark/ui/SparkUI.scala: ## @@ -167,7 +167,7 @@ private[spark] class SparkUI private ( attemptId = None, startTime

[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #40315: [SPARK-42699][CONNECT] SparkConnectServer should make client and AM same exit code

2023-03-07 Thread via GitHub
AngersZh commented on code in PR #40315: URL: https://github.com/apache/spark/pull/40315#discussion_r112640 ## sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala: ## @@ -736,13 +736,15 @@ class SparkSession private( } // scalastyle:on + def stop():

[GitHub] [spark] panbingkun commented on pull request #40316: [SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

2023-03-07 Thread via GitHub
panbingkun commented on PR #40316: URL: https://github.com/apache/spark/pull/40316#issuecomment-1459182974 cc @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [spark] ueshin commented on pull request #40323: [SPARK-42705][CONNECT] Fix spark.sql to return values from the command

2023-03-07 Thread via GitHub
ueshin commented on PR #40323: URL: https://github.com/apache/spark/pull/40323#issuecomment-1459184767 > Is there a similar case on Scala connect client ? I haven't tried Scala client, but yes, it would happen, and this will fix both. -- This is an automated message from the

[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side

2023-03-07 Thread via GitHub
AngersZh commented on code in PR #40314: URL: https://github.com/apache/spark/pull/40314#discussion_r1128899260 ## core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala: ## @@ -1005,17 +1005,20 @@ private[spark] class SparkSubmit extends Logging { e }

[GitHub] [spark] AngersZhuuuu commented on a diff in pull request #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side

2023-03-07 Thread via GitHub
AngersZh commented on code in PR #40314: URL: https://github.com/apache/spark/pull/40314#discussion_r1128902772 ## core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala: ## @@ -1005,17 +1005,20 @@ private[spark] class SparkSubmit extends Logging { e }

[GitHub] [spark] allanf-db opened a new pull request, #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-07 Thread via GitHub
allanf-db opened a new pull request, #40324: URL: https://github.com/apache/spark/pull/40324 ### What changes were proposed in this pull request? Adding a Spark Connect overview page to the Spark 3.4 documentation. ### Why are the changes needed? The first

[GitHub] [spark] itholic commented on a diff in pull request #40316: [SPARK-42679][CONNECT] createDataFrame doesn't work with non-nullable schema

2023-03-07 Thread via GitHub
itholic commented on code in PR #40316: URL: https://github.com/apache/spark/pull/40316#discussion_r1128906598 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -2876,6 +2876,13 @@ def test_unsupported_io_functions(self): with

  1   2   >