[GitHub] [spark] itholic commented on a diff in pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-06 Thread via GitHub


itholic commented on code in PR #40282:
URL: https://github.com/apache/spark/pull/40282#discussion_r1127483352


##
python/docs/source/development/errors.rst:
##
@@ -0,0 +1,92 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Error conditions in PySpark
+===
+
+This is a list of common, named error conditions returned by PySpark which are 
defined at `error_classes.py 
`_.
+
+When writing PySpark errors, developers must use an error condition from the 
list. If an appropriate error condition is not available, add a new one into 
the list. For more information, please refer to `Contributing Error and 
Exception 
`_.
+
+++--+

Review Comment:
   Or maybe do you want to organize all the error classes that exist in JVM and 
Python on one page?
   
   IMHO, it is better to document them separately in each document because most 
error classes on the JVM side are SQL-related error classes including SQLSTATE, 
and Python error classes are error classes for Python-specific types and values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-06 Thread via GitHub


itholic commented on code in PR #40282:
URL: https://github.com/apache/spark/pull/40282#discussion_r1127479134


##
python/docs/source/development/errors.rst:
##
@@ -0,0 +1,92 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Error conditions in PySpark
+===
+
+This is a list of common, named error conditions returned by PySpark which are 
defined at `error_classes.py 
`_.
+
+When writing PySpark errors, developers must use an error condition from the 
list. If an appropriate error condition is not available, add a new one into 
the list. For more information, please refer to `Contributing Error and 
Exception 
`_.
+
+++--+

Review Comment:
   Sure. I agree with that.
   Let me create another ticket for addressing error class documentation for 
JVM side as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40304: [SPARK-42665][CONNECT][Test] Mute Scala Client UDF test

2023-03-06 Thread via GitHub


LuciferYang commented on PR #40304:
URL: https://github.com/apache/spark/pull/40304#issuecomment-1457691351

   Thanks @HyukjinKwon :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40288: [SPARK-42496][CONNECT][DOCS] Introduction Spark Connect at main page.

2023-03-06 Thread via GitHub


itholic commented on PR #40288:
URL: https://github.com/apache/spark/pull/40288#issuecomment-1457691357

   Also cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40304: [SPARK-42665][CONNECT][Test] Mute Scala Client UDF test

2023-03-06 Thread via GitHub


HyukjinKwon closed pull request #40304: [SPARK-42665][CONNECT][Test] Mute Scala 
Client UDF test
URL: https://github.com/apache/spark/pull/40304


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40304: [SPARK-42665][CONNECT][Test] Mute Scala Client UDF test

2023-03-06 Thread via GitHub


HyukjinKwon commented on PR #40304:
URL: https://github.com/apache/spark/pull/40304#issuecomment-1457690785

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40311: [SPARK-42559][CONNECT][TESTS][FOLLOW-UP] Disable ANSI in several tests at DataFrameNaFunctionSuite.scala

2023-03-06 Thread via GitHub


amaliujia commented on PR #40311:
URL: https://github.com/apache/spark/pull/40311#issuecomment-1457682741

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40304: [SPARK-42665][CONNECT][Test] Mute Scala Client UDF test

2023-03-06 Thread via GitHub


LuciferYang commented on PR #40304:
URL: https://github.com/apache/spark/pull/40304#issuecomment-1457679758

   cc @HyukjinKwon , can we merge this first before new RC? otherwise, the 
maven test will still fail
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40271: SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-06 Thread via GitHub


itholic commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1127452614


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1268,6 +1268,12 @@ def test_bucket(self):
 message_parameters={"arg_name": "numBuckets", "arg_type": "str"},
 )
 
+def test_no_cast(self):

Review Comment:
   Oh,, yeah I thought we have `functions.cast`.
   
   Hmm... If `functions.cast` does not exist, I'm not sure if removing the 
duplication of `typing.cast` in advance would be appropriate.
   
   WDYT, @HyukjinKwon ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] olaky commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-06 Thread via GitHub


olaky commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1127449861


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala:
##
@@ -244,6 +245,89 @@ class FileMetadataStructSuite extends QueryTest with 
SharedSparkSession {
   parameters = Map("fieldName" -> "`file_name`", "fields" -> "`id`, 
`university`"))
   }
 
+  metadataColumnsTest("df metadataColumn - schema conflict",
+schemaWithNameConflicts) { (df, f0, f1) =>
+// the user data has the schema: name, age, _metadata.id, 
_metadata.university
+
+// get the real metadata column (whose name should have been adjusted)
+val metadataColumn = df.metadataColumn("_metadata")
+assert(metadataColumn.expr.asInstanceOf[NamedExpression].name == 
"__metadata")
+
+// select user data
+checkAnswer(
+  df.select("name", "age", "_METADATA", "_metadata")
+.withColumn("file_name", metadataColumn.getField("file_name")),
+  Seq(
+Row("jack", 24, Row(12345L, "uom"), Row(12345L, "uom"), 
f0(METADATA_FILE_NAME)),
+Row("lily", 31, Row(54321L, "ucb"), Row(54321L, "ucb"), 
f1(METADATA_FILE_NAME))
+  )
+)
+  }
+
+  metadataColumnsTest("df metadataColumn - no schema conflict",
+schema) { (df, f0, f1) =>
+// get the real metadata column (whose name should _NOT_ have been 
adjusted)
+val metadataColumn = df.metadataColumn("_metadata")
+assert(metadataColumn.expr.asInstanceOf[NamedExpression].name == 
"_metadata")
+
+// select user data
+checkAnswer(
+  df.select("name", "age")
+.withColumn("file_name", metadataColumn.getField("file_name")),
+  Seq(
+Row("jack", 24, f0(METADATA_FILE_NAME)),
+Row("lily", 31, f1(METADATA_FILE_NAME))
+  )
+)
+  }
+
+  metadataColumnsTest("df metadataColumn - column not found", schema) { (df, 
f0, f1) =>
+// Not a column at all
+checkError(
+  exception = intercept[AnalysisException] {
+df.withMetadataColumn("foo")
+  },
+  errorClass = "UNRESOLVED_COLUMN.WITH_SUGGESTION",
+  parameters = Map("objectName" -> "`foo`", "proposal" -> "`_metadata`"))
+
+// Name exists, but does not reference a metadata column
+checkError(
+  exception = intercept[AnalysisException] {
+df.withMetadataColumn("name")
+  },
+  errorClass = "UNRESOLVED_COLUMN.WITH_SUGGESTION",
+  parameters = Map("objectName" -> "`name`", "proposal" -> "`_metadata`"))
+  }
+
+  metadataColumnsTest("metadata name conflict resolved with leading 
underscores - one",
+schemaWithNameConflicts) { (df, f0, f1) =>
+// the user data has the schema: name, age, _metadata.id, 
_metadata.university
+
+checkAnswer(
+  df.select("name", "age", "_metadata", "__metadata.file_name"),
+  Seq(
+Row("jack", 24, Row(12345L, "uom"), f0(METADATA_FILE_NAME)),
+Row("lily", 31, Row(54321L, "ucb"), f1(METADATA_FILE_NAME))
+  )
+)
+  }
+
+  metadataColumnsTest("metadata name conflict resolved with leading 
underscores - several",
+new StructType()
+  .add(schema("name").copy(name = "_metadata"))
+  .add(schema("age").copy(name = "__metadata"))
+  .add(schema("info").copy(name = "___metadata"))) { (df, f0, f1) =>
+// the user data has the schema: _metadata, __metadata, ___metadata.id, 
___metadata.university
+
+checkAnswer(
+  df.select("_metadata", "__metadata", "___metadata", 
"metadata.file_name"),
+  Seq(
+Row("jack", 24, Row(12345L, "uom"), f0(METADATA_FILE_NAME)),
+Row("lily", 31, Row(54321L, "ucb"), f1(METADATA_FILE_NAME))
+  )
+)
+  }
+

Review Comment:
   Question about a use case with aliasing: If the metadata column is aliased, 
we still have to use the original name in withMetadata (or the other 
functions). So
   ```
   df.select("my_metdata", _metadata).withMetadata("another_metadata_name", 
"my_metadata")
   ```
   Is not possible I believe? I think this is ok, it just reflects that 
withMetadata is immune to renaming, but could be good to document this clearly



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##
@@ -42,6 +42,24 @@ abstract class LogicalPlan
*/
   def metadataOutput: Seq[Attribute] = children.flatMap(_.metadataOutput)
 
+  /**
+   * Finds a metadata attribute of this node by its logical name. This search 
will work even if the
+   * metadata attribute was renamed because of a conflicting name in the data 
schema.
+   */
+  def getMetadataAttributeByName(name: String): Attribute = {
+// NOTE: An already-referenced column might appear in `output` instead of 
`metadataOutput`.

Review Comment:
   I had an instance in the debugger where this was not the case yesterday



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##
@@ -42,6 +42,24 @@ abstract class 

[GitHub] [spark] viirya commented on a diff in pull request #40215: [SPARK-42591][SS][DOCS] Add examples of unblocked workloads after SPARK-42376

2023-03-06 Thread via GitHub


viirya commented on code in PR #40215:
URL: https://github.com/apache/spark/pull/40215#discussion_r1127438527


##
docs/structured-streaming-programming-guide.md:
##
@@ -1848,12 +1848,137 @@ Additional details on supported joins:
 
 - As of Spark 2.4, you can use joins only when the query is in Append output 
mode. Other output modes are not yet supported.
 
-- As of Spark 2.4, you cannot use other non-map-like operations before joins. 
Here are a few examples of
-  what cannot be used.
+- You cannot use mapGroupsWithState and flatMapGroupsWithState before and 
after joins.
 
-  - Cannot use streaming aggregations before joins.
+In append output mode, you can construct a query having non-map-like 
operations before/after join.
 
-  - Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode 
before joins.
+For example, here's an example of time window aggregation in both streams 
followed by stream-stream join with event time window:
+
+
+
+
+{% highlight scala %}
+
+val clicksWindow = clicksWithWatermark
+  .groupBy(window("clickTime", "1 hour"))
+  .count()
+
+val impressionsWindow = impressionsWithWatermark
+  .groupBy(window("impressionTime", "1 hour"))
+  .count()
+
+clicksWindow.join(impressionsWindow, "window", "inner")
+
+{% endhighlight %}
+
+
+
+
+{% highlight java %}
+
+Dataset clicksWindow = clicksWithWatermark
+  .groupBy(functions.window(clicksWithWatermark.col("clickTime"), "1 hour"))
+  .count();
+
+Dataset impressionsWindow = impressionsWithWatermark
+  .groupBy(functions.window(impressionsWithWatermark.col("impressionTime"), "1 
hour"))
+  .count();
+
+clicksWindow.join(impressionsWindow, "window", "inner");
+
+{% endhighlight %}
+
+
+
+
+
+{% highlight python %}
+joined = impressionsWithWatermark.join(
+  clicksWithWatermark,
+  expr("""
+clickAdId = impressionAdId AND
+clickTime >= impressionTime AND
+clickTime <= impressionTime + interval 1 hour
+  """),
+  "leftOuter" # can be "inner", "leftOuter", "rightOuter", 
"fullOuter", "leftSemi"
+)
+
+joined.groupBy(
+  joined.clickAdId,
+  window(joined.clickTime, "1 hour")
+).count()

Review Comment:
   This python example seems duplicate to the bottom one (join followed by time 
window aggregation).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a diff in pull request #40215: [SPARK-42591][SS][DOCS] Add examples of unblocked workloads after SPARK-42376

2023-03-06 Thread via GitHub


viirya commented on code in PR #40215:
URL: https://github.com/apache/spark/pull/40215#discussion_r1127412999


##
docs/structured-streaming-programming-guide.md:
##
@@ -1848,12 +1848,137 @@ Additional details on supported joins:
 
 - As of Spark 2.4, you can use joins only when the query is in Append output 
mode. Other output modes are not yet supported.
 
-- As of Spark 2.4, you cannot use other non-map-like operations before joins. 
Here are a few examples of
-  what cannot be used.
+- You cannot use mapGroupsWithState and flatMapGroupsWithState before and 
after joins.
 
-  - Cannot use streaming aggregations before joins.
+In append output mode, you can construct a query having non-map-like 
operations before/after join.

Review Comment:
   `non-map-like operations` sounds not clear to me. Can you add some examples 
like:
   
   ```
   In append output mode, you can construct a query having non-map-like 
operations, e.g. ..., ..., ..., before/after join.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon opened a new pull request, #40311: [SPARK-42559][CONNECT][TESTS][FOLLOW-UP] Disable ANSI in several tests at DataFrameNaFunctionSuite.scala

2023-03-06 Thread via GitHub


HyukjinKwon opened a new pull request, #40311:
URL: https://github.com/apache/spark/pull/40311

   ### What changes were proposed in this pull request?
   
   This PR proposes to disable ANSI mode in both `replace float with nan` and 
`replace double with nan` tests.
   
   ### Why are the changes needed?
   
   To recover the build https://github.com/apache/spark/actions/runs/4349682658 
with ANSI mode on.
   Spark Connect side does not fully leverage the error framework yet .. so 
simply disabling it for now.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   Manually ran them in IDE with ANSI mode on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


jerqi commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1457625037

   > spark.shuffle.reduceLocality.enabled
   
   Thanks, I got it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


pan3793 commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1457619866

   @jerqi locality may still have benefits when RSS works in hybrid 
deployments, besides, there is a dedicated configuration for that 
`spark.shuffle.reduceLocality.enabled`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jerqi commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


jerqi commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1457606879

   Hi @mridulm , thanks for your great work! Apache Uniffle is similar project 
to Apache Celeborn.  We also patched to the Apache Spark like 
https://github.com/apache/incubator-uniffle/blob/master/spark-patches/spark-3.2.1_dynamic_allocation_support.patch.
  Considering that the shuffle data is stored in in distributed filesystem or 
in a disaggregated shuffle cluster, maybe  we should modify the method 
`ShuffledRowRDD#getPreferredLocations`, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #37880: [SPARK-39399] [CORE] [K8S]: Fix proxy-user authentication for Spark on k8s in cluster deploy mode

2023-03-06 Thread via GitHub


shrprasa commented on PR #37880:
URL: https://github.com/apache/spark/pull/37880#issuecomment-1457588129

   Gentle ping @holdenk @dongjoon-hyun @Ngone51 , @HyukjinKwon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates

2023-03-06 Thread via GitHub


shrprasa commented on PR #40128:
URL: https://github.com/apache/spark/pull/40128#issuecomment-1457586866

   gentle ping @holdenk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40215: [SPARK-42591][SS][DOCS] Add examples of unblocked workloads after SPARK-42376

2023-03-06 Thread via GitHub


HeartSaVioR commented on PR #40215:
URL: https://github.com/apache/spark/pull/40215#issuecomment-1457585553

   cc. @viirya as well who may be interested with new feature in SS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL]:Incorrect ambiguous column reference error

2023-03-06 Thread via GitHub


shrprasa commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1457585690

   Gentle Ping @srowen  @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40215: [SPARK-42591][SS][DOCS] Add examples of unblocked workloads after SPARK-42376

2023-03-06 Thread via GitHub


HeartSaVioR commented on PR #40215:
URL: https://github.com/apache/spark/pull/40215#issuecomment-1457584928

   cc. @zsxwing @rangadi @jerrypeng @anishshri-db @chaoqin-li1123 
   cc-ing folks who reviewed the code change PR. This PR is a doc change to 
show up what is being unblocked, like we did in 
https://github.com/apache/spark/pull/40188 for fixing broken late record 
filtering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40310: [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names

2023-03-06 Thread via GitHub


amaliujia commented on code in PR #40310:
URL: https://github.com/apache/spark/pull/40310#discussion_r1127370577


##
python/pyspark/sql/connect/session.py:
##
@@ -235,6 +235,9 @@ def createDataFrame(
 # If no schema supplied by user then get the names of columns only
 if schema is None:
 _cols = [str(x) if not isinstance(x, str) else x for x in 
data.columns]
+elif isinstance(schema, (list, tuple)) and _num_cols < 
len(data.columns):
+_cols = _cols + [f"_{i + 1}" for i in range(_num_cols, 
len(data.columns))]

Review Comment:
   In fact, I guess probably we can do a bit more: need to make sure the user 
provided column name are not the same as the auto-generated one.
   
   Though the probability of the collision is small so maybe this is not a big 
concern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40310: [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names

2023-03-06 Thread via GitHub


amaliujia commented on code in PR #40310:
URL: https://github.com/apache/spark/pull/40310#discussion_r1127370577


##
python/pyspark/sql/connect/session.py:
##
@@ -235,6 +235,9 @@ def createDataFrame(
 # If no schema supplied by user then get the names of columns only
 if schema is None:
 _cols = [str(x) if not isinstance(x, str) else x for x in 
data.columns]
+elif isinstance(schema, (list, tuple)) and _num_cols < 
len(data.columns):
+_cols = _cols + [f"_{i + 1}" for i in range(_num_cols, 
len(data.columns))]

Review Comment:
   In fact, I guess probably we can do a bit more: need to make sure the user 
provided column name is not the same as the auto-generated one.
   
   Though the probability of the collision is small so maybe this is not a big 
concern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40310: [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names

2023-03-06 Thread via GitHub


amaliujia commented on code in PR #40310:
URL: https://github.com/apache/spark/pull/40310#discussion_r1127370577


##
python/pyspark/sql/connect/session.py:
##
@@ -235,6 +235,9 @@ def createDataFrame(
 # If no schema supplied by user then get the names of columns only
 if schema is None:
 _cols = [str(x) if not isinstance(x, str) else x for x in 
data.columns]
+elif isinstance(schema, (list, tuple)) and _num_cols < 
len(data.columns):
+_cols = _cols + [f"_{i + 1}" for i in range(_num_cols, 
len(data.columns))]

Review Comment:
   In fact, I guess probably we can do a bit more: need to make sure the user 
provided column name is not the same as the auto-generated one.
   
   Though the probability of the collision is small.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40310: [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names

2023-03-06 Thread via GitHub


amaliujia commented on PR #40310:
URL: https://github.com/apache/spark/pull/40310#issuecomment-1457579306

   LGTM!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40288: [WIP][SPARK-42496][CONNECT][DOCS] Introduction Spark Connect at main page.

2023-03-06 Thread via GitHub


itholic commented on PR #40288:
URL: https://github.com/apache/spark/pull/40288#issuecomment-1457564181

   cc @allanf-db addressed the comments we discussed in offline


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-06 Thread via GitHub


itholic commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1457558427

   Thanks, @panbingkun !
   By the way, I think this issue has a pretty high priority since the default 
nullability of a schema is `False`.
   
   ```python
   >>> sdf = spark.range(10).schema
   self._schema: StructType([StructField('id', LongType(), False)])
   ```
   
   For example, even intuitive and simple code like creating a DataFrame from a 
pandas DataFrame fails as follows:
   ```python
   >>> sdf = spark.range(10)
   >>> pdf = sdf.toPandas()
   >>> spark.createDataFrame(pdf, sdf.schema)
   Traceback (most recent call last):
   ...
   pyspark.errors.exceptions.connect.AnalysisException: 
[NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required 
to be non-nullable.
   ```
   
   Feel free to ping me anytime if you need any help!
   Thanks again for your time on investigating this :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127081206


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   This resolution was substantially different compared to what we do in normal 
writes or in data sources that actually support row-level operations. I am 
migrating to the logic that is close to by name resolution in V2 tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127081206


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   This resolution was substantially different compared to what we do in normal 
writes or in data sources that actually support row-level operations. I am 
migrating to the logic that is close to by name resolution in v2 tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127348254


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -2057,6 +2057,17 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
 "errors" -> errors.mkString("\n- ")))
   }
 
+  def invalidRowLevelOperationAssignments(
+  assignments: Seq[Assignment],
+  errors: Seq[String]): Throwable = {
+
+new AnalysisException(
+  errorClass = "DATATYPE_MISMATCH.INVALID_ROW_LEVEL_OPERATION_ASSIGNMENTS",

Review Comment:
   I am using `DATATYPE_MISMATCH` as it seems appropriate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on PR #40308:
URL: https://github.com/apache/spark/pull/40308#issuecomment-1457537193

   cc @huaxingao @cloud-fan @dongjoon-hyun @sunchao @viirya @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127343402


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:
##
@@ -129,7 +129,7 @@ object TableOutputResolver {
 }
   }
 
-  private def checkNullability(
+  private[analysis] def checkNullability(

Review Comment:
   I want to reuse code from `TableOutputResolver` wherever possible. However, 
adding assignment processing directly there would make that class even more 
complicated than it is today. That's why I decided to open up some methods 
instead.
   
   Feedback is appreciated. I could add `applyUpdate` to `TableOutputResolver` 
too. No preference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127343402


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:
##
@@ -129,7 +129,7 @@ object TableOutputResolver {
 }
   }
 
-  private def checkNullability(
+  private[analysis] def checkNullability(

Review Comment:
   I want to reuse code from `TableOutputResolver` wherever possible. However, 
adding assignment processing directly there would make that class even more 
complicated than it is today. That's why I decided to open up some methods 
instead.
   
   Feedback is appreciated. I could add `applyUpdate` to `TableOutputResolver` 
too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127343402


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:
##
@@ -129,7 +129,7 @@ object TableOutputResolver {
 }
   }
 
-  private def checkNullability(
+  private[analysis] def checkNullability(

Review Comment:
   I want to reuse code from `TableOutputResolver` wherever possible. However, 
adding assignment processing directly there would make that class even more 
complicated than it is today. That's why I decided to open up some methods 
instead.
   
   Feedback is appreciated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127343402


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:
##
@@ -129,7 +129,7 @@ object TableOutputResolver {
 }
   }
 
-  private def checkNullability(
+  private[analysis] def checkNullability(

Review Comment:
   I want to reuse code from `TableOutputResolver` wherever possible but adding 
assignment processing would make it even more complicated than it is today. 
That's why I decided to open up some methods instead.
   
   Let me know your thoughts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127342306


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentUtils.scala:
##
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeReference, CreateNamedStruct, Expression, ExtractValue, 
GetStructField, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.plans.logical.Assignment
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType}
+
+object AssignmentUtils extends SQLConfHelper with CastSupport {
+
+  private case class ColumnUpdate(ref: Seq[String], expr: Expression)
+
+  /**
+   * Aligns assignments to match table columns.
+   * 
+   * This method processes and reorders given assignments so that each target 
column gets
+   * an expression it should be set to. If a column does not have a matching 
assignment,
+   * it will be set to its current value. For example, if one passes table 
attributes c1, c2
+   * and an assignment c2 = 1, this method will return c1 = c1, c2 = 1.
+   * 
+   * This method also handles updates to nested columns. If there is an 
assignment to a particular
+   * nested field, this method will construct a new struct with one field 
updated preserving other
+   * fields that have not been modified. For example, if one passes table 
attributes c1, c2
+   * where c2 is a struct with fields n1 and n2 and an assignment c2.n2 = 1, 
this method will
+   * return c1 = c1, c2 = struct(c2.n1, 1).
+   *
+   * @param attrs table attributes
+   * @param assignments assignments to align
+   * @return aligned assignments that match table columns
+   */
+  def alignAssignments(
+  attrs: Seq[Attribute],
+  assignments: Seq[Assignment]): Seq[Assignment] = {
+
+val errors = new mutable.ArrayBuffer[String]()
+
+val output = applyUpdates(
+  updates = assignments.map(toColumnUpdate),
+  cols = attrs.map(restoreActualType),
+  colExprs = attrs,
+  addError = err => errors += err)
+
+if (errors.nonEmpty) {
+  throw 
QueryCompilationErrors.invalidRowLevelOperationAssignments(assignments, 
errors.toSeq)
+}
+
+attrs.zip(output).map { case (attr, expr) => Assignment(attr, expr) }
+  }
+
+  private def toColumnUpdate(assignment: Assignment): ColumnUpdate = {
+ColumnUpdate(toRef(assignment.key), assignment.value)
+  }
+
+  private def restoreActualType(attr: Attribute): Attribute = {
+
attr.withDataType(CharVarcharUtils.getRawType(attr.metadata).getOrElse(attr.dataType))
+  }
+
+  private def applyUpdates(
+  updates: Seq[ColumnUpdate],
+  cols: Seq[Attribute],
+  colExprs: Seq[Expression],
+  addError: String => Unit,
+  colPath: Seq[String] = Nil): Seq[Expression] = {
+
+// iterate through columns at the current level and find matching updates
+cols.zip(colExprs).map { case (col, colExpr) =>
+  // find matches for this column or any of its children
+  val prefixMatchedUpdates = updates.filter(update => 
conf.resolver(update.ref.head, col.name))
+  prefixMatchedUpdates match {
+// if there is no exact match and no match for children, return the 
column expr as is
+case matchedUpdates if matchedUpdates.isEmpty =>
+  colExpr
+
+// if there is only one update and it is an exact match, return the 
assigned expression
+case Seq(matchedUpdate) if isExactMatch(matchedUpdate, col) =>
+  applyUpdate(matchedUpdate.expr, col, addError, colPath :+ col.name)
+
+// if there are matches only for children
+case matchedUpdates if !hasExactMatch(matchedUpdates, col) =>
+  val newColPath = colPath 

[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127340319


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentUtils.scala:
##
@@ -0,0 +1,275 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.SQLConfHelper
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeReference, CreateNamedStruct, Expression, ExtractValue, 
GetStructField, Literal, NamedExpression}
+import org.apache.spark.sql.catalyst.plans.logical.Assignment
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType}
+
+object AssignmentUtils extends SQLConfHelper with CastSupport {
+
+  private case class ColumnUpdate(ref: Seq[String], expr: Expression)
+
+  /**
+   * Aligns assignments to match table columns.
+   * 
+   * This method processes and reorders given assignments so that each target 
column gets
+   * an expression it should be set to. If a column does not have a matching 
assignment,
+   * it will be set to its current value. For example, if one passes table 
attributes c1, c2
+   * and an assignment c2 = 1, this method will return c1 = c1, c2 = 1.
+   * 
+   * This method also handles updates to nested columns. If there is an 
assignment to a particular
+   * nested field, this method will construct a new struct with one field 
updated preserving other
+   * fields that have not been modified. For example, if one passes table 
attributes c1, c2
+   * where c2 is a struct with fields n1 and n2 and an assignment c2.n2 = 1, 
this method will
+   * return c1 = c1, c2 = struct(c2.n1, 1).
+   *
+   * @param attrs table attributes
+   * @param assignments assignments to align
+   * @return aligned assignments that match table columns
+   */
+  def alignAssignments(

Review Comment:
   The logic in this method tries to follow by name resolution we have in V2 
tables.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127079791


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignRowLevelCommandAssignments.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, UpdateTable}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+
+/**
+ * A rule that aligns assignments in row-level operations.
+ *
+ * Note that this rule must be run after resolving default values but before 
rewriting row-level

Review Comment:
   We need to think about a reliable way to check if default values have been 
resolved. Right now, it simply relies on the order of rules, which is fragile. 
Ideas are welcome.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #39931: [SPARK-42376][SS] Introduce watermark propagation among operators

2023-03-06 Thread via GitHub


HeartSaVioR closed pull request #39931: [SPARK-42376][SS] Introduce watermark 
propagation among operators
URL: https://github.com/apache/spark/pull/39931


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #39931: [SPARK-42376][SS] Introduce watermark propagation among operators

2023-03-06 Thread via GitHub


HeartSaVioR commented on PR #39931:
URL: https://github.com/apache/spark/pull/39931#issuecomment-1457521207

   Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #39931: [SPARK-42376][SS] Introduce watermark propagation among operators

2023-03-06 Thread via GitHub


HeartSaVioR commented on PR #39931:
URL: https://github.com/apache/spark/pull/39931#issuecomment-1457520455

   Thanks all for quite huge efforts on reviewing this complicated change! The 
implementation got better with the review comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zsxwing commented on a diff in pull request #39931: [SPARK-42376][SS] Introduce watermark propagation among operators

2023-03-06 Thread via GitHub


zsxwing commented on code in PR #39931:
URL: https://github.com/apache/spark/pull/39931#discussion_r1127324257


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/WatermarkPropagator.scala:
##
@@ -0,0 +1,322 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming
+
+import java.{util => jutil}
+
+import scala.collection.mutable
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.util.Utils
+
+/**
+ * Interface for propagating watermark. The implementation is not required to 
be thread-safe,
+ * as all methods are expected to be called from the query execution thread.
+ * (The guarantee may change on further improvements on Structured Streaming - 
update
+ * implementations if we change the guarantee.)
+ */
+sealed trait WatermarkPropagator {
+  /**
+   * Request to propagate watermark among operators based on origin watermark 
value. The result
+   * should be input watermark per stateful operator, which Spark will request 
the value by calling
+   * getInputWatermarkXXX with operator ID.
+   *
+   * It is recommended for implementation to cache the result, as Spark can 
request the propagation
+   * multiple times with the same batch ID and origin watermark value.
+   */
+  def propagate(batchId: Long, plan: SparkPlan, originWatermark: Long): Unit
+
+  /** Provide the calculated input watermark for late events for given 
stateful operator. */
+  def getInputWatermarkForLateEvents(batchId: Long, stateOpId: Long): Long
+
+  /** Provide the calculated input watermark for eviction for given stateful 
operator. */
+  def getInputWatermarkForEviction(batchId: Long, stateOpId: Long): Long
+
+  /**
+   * Request to clean up cached result on propagation. Spark will call this 
method when the given
+   * batch ID will be likely to be not re-executed.
+   */
+  def purge(batchId: Long): Unit
+}
+
+/**
+ * Do nothing. This is dummy implementation to help creating a dummy 
IncrementalExecution instance.
+ */
+object NoOpWatermarkPropagator extends WatermarkPropagator {
+  def propagate(batchId: Long, plan: SparkPlan, originWatermark: Long): Unit = 
{}
+  def getInputWatermarkForLateEvents(batchId: Long, stateOpId: Long): Long = 
Long.MinValue
+  def getInputWatermarkForEviction(batchId: Long, stateOpId: Long): Long = 
Long.MinValue
+  def purge(batchId: Long): Unit = {}
+}
+
+/**
+ * This implementation uses a single global watermark for late events and 
eviction.
+ *
+ * This implementation provides the behavior before Structured Streaming 
supports multiple stateful
+ * operators. (prior to SPARK-40925) This is only used for compatibility mode.
+ */
+class UseSingleWatermarkPropagator extends WatermarkPropagator {
+  // We use treemap to sort the key (batchID) and evict old batch IDs 
efficiently.
+  private val batchIdToWatermark: jutil.TreeMap[Long, Long] = new 
jutil.TreeMap[Long, Long]()
+
+  private def isInitialized(batchId: Long): Boolean = 
batchIdToWatermark.containsKey(batchId)
+
+  override def propagate(batchId: Long, plan: SparkPlan, originWatermark: 
Long): Unit = {
+if (batchId < 0) {
+  // no-op
+} else if (isInitialized(batchId)) {
+  val cached = batchIdToWatermark.get(batchId)
+  assert(cached == originWatermark,
+s"Watermark has been changed for the same batch ID! Batch ID: 
$batchId, " +
+  s"Value in cache: $cached, value given: $originWatermark")
+} else {
+  batchIdToWatermark.put(batchId, originWatermark)
+}
+  }
+
+  private def getInputWatermark(batchId: Long, stateOpId: Long): Long = {
+if (batchId < 0) {
+  0
+} else {
+  assert(isInitialized(batchId), s"Watermark for batch ID $batchId is not 
yet set!")
+  batchIdToWatermark.get(batchId)
+}
+  }
+
+  def getInputWatermarkForLateEvents(batchId: Long, stateOpId: Long): Long =
+getInputWatermark(batchId, stateOpId)
+
+  def getInputWatermarkForEviction(batchId: Long, stateOpId: Long): Long =
+getInputWatermark(batchId, stateOpId)
+
+  override def purge(batchId: Long): 

[GitHub] [spark] cloud-fan commented on pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-06 Thread via GitHub


cloud-fan commented on PR #40300:
URL: https://github.com/apache/spark/pull/40300#issuecomment-1457500143

   It's a good idea to provide an API that allows people to unambiguously 
reference metadata columns, and I like the new `Dataset.metadataColumn` 
function. However, I think the prepending underscore approach is a bit hacky. 
It's too implicit and I'd prefer a more explicit syntax like `SELECT 
metadata(_metadata) FROM t`. We can discuss this more and invite more SQL 
experts. Shall we exclude it from this PR for now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-06 Thread via GitHub


cloud-fan commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1127321842


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##
@@ -42,6 +42,24 @@ abstract class LogicalPlan
*/
   def metadataOutput: Seq[Attribute] = children.flatMap(_.metadataOutput)
 
+  /**
+   * Finds a metadata attribute of this node by its logical name. This search 
will work even if the
+   * metadata attribute was renamed because of a conflicting name in the data 
schema.
+   */
+  def getMetadataAttributeByName(name: String): Attribute = {
+// NOTE: An already-referenced column might appear in `output` instead of 
`metadataOutput`.

Review Comment:
   metadata col may appear in `output`, but always appear in `metadataOutput`. 
I think `outputMetadataAttributes.resolve(nameParts, resolver)` should do the 
work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-06 Thread via GitHub


beliefer commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1127311861


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -140,6 +140,9 @@ message Read {
 
 // (Optional) A list of path for file-system backed data sources.
 repeated string paths = 4;
+
+// (Optional) Condition in the where clause for each partition.

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-06 Thread via GitHub


cloud-fan commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1127307622


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2714,6 +2726,17 @@ class Dataset[T] private[sql](
*/
   def withColumn(colName: String, col: Column): DataFrame = 
withColumns(Seq(colName), Seq(col))
 
+  /**
+   * Returns a new Dataset by selecting a metadata column with the given 
logical name.
+   *
+   * A metadata column can be accessed this way even if the underlying data 
source defines a data
+   * column with a conflicting name.
+   *
+   * @group untypedrel
+   * @since 4.0.0
+   */
+  def withMetadataColumn(colName: String): DataFrame = withColumn(colName, 
metadataColumn(colName))

Review Comment:
   This is a weird API as there is no `withColumn` function that takes only a 
string parameter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-06 Thread via GitHub


cloud-fan commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1127307264


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2714,6 +2726,17 @@ class Dataset[T] private[sql](
*/
   def withColumn(colName: String, col: Column): DataFrame = 
withColumns(Seq(colName), Seq(col))
 
+  /**
+   * Returns a new Dataset by selecting a metadata column with the given 
logical name.
+   *
+   * A metadata column can be accessed this way even if the underlying data 
source defines a data
+   * column with a conflicting name.
+   *
+   * @group untypedrel
+   * @since 4.0.0
+   */
+  def withMetadataColumn(colName: String): DataFrame = withColumn(colName, 
metadataColumn(colName))

Review Comment:
   people can just do `df.withColumn(name, df.metadataColumn(col))`, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40283: [SPARK-42673][BUILD] Make `build/mvn` build Spark only with the verified maven version

2023-03-06 Thread via GitHub


LuciferYang commented on PR #40283:
URL: https://github.com/apache/spark/pull/40283#issuecomment-1457450172

   Thanks @dongjoon-hyun @pan3793 ~
   Also thanks @gnodet @hboutemy 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-06 Thread via GitHub


hvanhovell commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1457433571

   @beliefer here is the thing. When this was designed it was mainly aimed at 
sql, and there we definitely do not generate unique names in lambda functions 
either. This is all done in the analyzer. We should be able to follow the same 
path.
   
   Do you happen to know if test failing for python also fail for scala?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-06 Thread via GitHub


beliefer commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1127291008


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -250,6 +250,46 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 jdbc(url, table, connectionProperties)
   }
 
+  /**
+   * Construct a `DataFrame` representing the database table accessible via 
JDBC URL url named
+   * table using connection properties. The `predicates` parameter gives a 
list expressions
+   * suitable for inclusion in WHERE clauses; each one defines one partition 
of the `DataFrame`.
+   *
+   * Don't create too many partitions in parallel on a large cluster; 
otherwise Spark might crash
+   * your external database systems.
+   *
+   * You can find the JDBC-specific option and parameter documentation for 
reading tables via JDBC
+   * in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option;>
+   * Data Source Option in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param predicates
+   *   Condition in the where clause for each partition.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string 
tag/value. Normally at least
+   *   a "user" and "password" property should be included. "fetchsize" can be 
used to control the
+   *   number of rows per fetch.
+   * @since 3.4.0
+   */
+  def jdbc(
+  url: String,
+  table: String,
+  predicates: Array[String],
+  connectionProperties: Properties): DataFrame = {
+sparkSession.newDataFrame { builder =>

Review Comment:
   Yeah. we can't rely on client.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-06 Thread via GitHub


WeichenXu123 commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1127288752


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.ml
+
+import org.apache.spark.connect.proto
+import org.apache.spark.ml
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.classification.TrainingSummary
+import org.apache.spark.sql.DataFrame
+
+
+object AlgorithmRegistry {
+
+  def get(name: String): Algorithm = {
+name match {
+  case "LogisticRegression" => new LogisticRegressionAlgorithm
+  case _ =>
+throw new IllegalArgumentException()
+}
+  }
+
+}
+
+
+abstract class Algorithm {
+
+  def initiateEstimator(uid: String): Estimator[_]
+
+  def getModelAttr(model: Model[_], name: String): 
Either[proto.MlCommandResponse, DataFrame]

Review Comment:
   `DataFrame` case cannot be put in `MlCommandResponse`, because the 
`DataFrame` case we directly generate the dataframe plan in client side, we 
don't send request to server .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40296: [SPARK-42680][CONNECT][TESTS] Create the helper function withSQLConf for connect test framework

2023-03-06 Thread via GitHub


beliefer commented on PR #40296:
URL: https://github.com/apache/spark/pull/40296#issuecomment-1457422020

   @HyukjinKwon @zhengruifeng Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40287: [SPARK-42562][CONNECT] UnresolvedNamedLambdaVariable in python do not need unique names

2023-03-06 Thread via GitHub


beliefer commented on PR #40287:
URL: https://github.com/apache/spark/pull/40287#issuecomment-1457418420

   @hvanhovell Scala also uses 
`UnresolvedNamedLambdaVariable.freshVarName("x")` to get the unique names. see: 
   
https://github.com/apache/spark/blob/201e08c03a31c763e3120540ac1b1ca8ef252e6b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4096


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-06 Thread via GitHub


zhengruifeng commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1127232443


##
connector/connect/common/src/main/protobuf/spark/connect/ml.proto:
##
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+syntax = 'proto3';
+
+package spark.connect;
+
+import "google/protobuf/any.proto";
+import "spark/connect/expressions.proto";
+import "spark/connect/types.proto";
+import "spark/connect/relations.proto";
+import "spark/connect/ml_common.proto";
+
+option java_multiple_files = true;
+option java_package = "org.apache.spark.connect.proto";
+
+
+message Evaluator {
+  string name = 1;
+  Params params = 2;
+  string uid = 3;
+}
+
+
+message MlCommand {
+  oneof ml_command_type {
+Fit fit = 1;
+ModelAttr model_attr = 2;
+ModelSummaryAttr model_summary_attr = 3;
+LoadModel load_model = 4;
+SaveModel save_model = 5;
+Evaluate evaluate = 6;
+  }
+
+  message Fit {
+Stage estimator = 1;
+Relation dataset = 2;
+  }
+
+  message Evaluate {
+Evaluator evaluator = 1;
+  }
+
+  message LoadModel {
+string name = 1;
+string path = 2;
+  }
+
+  message SaveModel {
+int64 model_ref_id = 1;
+string path = 2; // saving path
+bool overwrite = 3;
+map options = 4; // saving options
+  }
+
+  message ModelAttr {
+int64 model_ref_id = 1;
+string name = 2;
+  }
+
+  message ModelSummaryAttr {
+int64 model_ref_id = 1;
+string name = 2;
+Params params = 3;
+
+// Evaluation dataset that it uses to computes
+// the summary attribute
+// If not set, get attributes from
+// model.summary (i.e. the summary on training dataset)
+optional Relation evaluation_dataset = 4;
+  }
+}
+
+
+message MlCommandResponse {
+  oneof ml_command_response_type {
+Expression.Literal literal = 1;
+ModelInfo model_info = 2;
+Vector vector = 3;
+Matrix matrix = 4;
+  }
+  message ModelInfo {
+int64 model_ref_id = 1;
+string model_uid = 2;
+  }
+}
+
+
+message Vector {
+  oneof one_of {
+Dense dense = 1;
+Sparse sparse = 2;
+  }
+  message Dense {
+repeated double values = 1;
+  }
+  message Sparse {
+int32 size = 1;
+repeated double indices = 2;
+repeated double values = 3;
+  }
+}
+
+message Matrix {
+  oneof one_of {
+Dense dense = 1;
+Sparse sparse = 2;
+  }
+  message Dense {
+int32 num_rows = 1;
+int32 num_cols = 2;
+repeated double values = 3;

Review Comment:
   nit, `DenseMatrix` also has `isTransposed`



##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/AlgorithmRegisty.scala:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.ml
+
+import org.apache.spark.connect.proto
+import org.apache.spark.ml
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.classification.TrainingSummary
+import org.apache.spark.sql.DataFrame
+
+
+object AlgorithmRegistry {
+
+  def get(name: String): Algorithm = {
+name match {
+  case "LogisticRegression" => new LogisticRegressionAlgorithm
+  case _ =>
+throw new IllegalArgumentException()
+}
+  }
+
+}
+
+
+abstract class Algorithm {
+
+  def initiateEstimator(uid: String): Estimator[_]
+
+  def getModelAttr(model: Model[_], name: String): 
Either[proto.MlCommandResponse, DataFrame]

Review Comment:
   why not making 

[GitHub] [spark] HyukjinKwon commented on pull request #40244: [SPARK-42643][CONNECT][PYTHON] Register Java (aggregate) user-defined functions

2023-03-06 Thread via GitHub


HyukjinKwon commented on PR #40244:
URL: https://github.com/apache/spark/pull/40244#issuecomment-1457397715

   WDYT @hvanhovell ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #40296: [SPARK-42680][CONNECT][TESTS] Create the helper function withSQLConf for connect test framework

2023-03-06 Thread via GitHub


HyukjinKwon closed pull request #40296: [SPARK-42680][CONNECT][TESTS] Create 
the helper function withSQLConf for connect test framework
URL: https://github.com/apache/spark/pull/40296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #40296: [SPARK-42680][CONNECT][TESTS] Create the helper function withSQLConf for connect test framework

2023-03-06 Thread via GitHub


HyukjinKwon commented on PR #40296:
URL: https://github.com/apache/spark/pull/40296#issuecomment-1457393807

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40309: [SPARK-42688][CONNECT] Rename Connect proto Request client_id to session_id

2023-03-06 Thread via GitHub


hvanhovell closed pull request #40309: [SPARK-42688][CONNECT] Rename Connect 
proto Request client_id to session_id
URL: https://github.com/apache/spark/pull/40309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40309: [SPARK-42688][CONNECT] Rename Connect proto Request client_id to session_id

2023-03-06 Thread via GitHub


hvanhovell commented on PR #40309:
URL: https://github.com/apache/spark/pull/40309#issuecomment-1457390771

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vitaliili-db commented on pull request #40295: [SPARK-42681] Relax ordering constraint for ALTER TABLE ADD|REPLACE column options

2023-03-06 Thread via GitHub


vitaliili-db commented on PR #40295:
URL: https://github.com/apache/spark/pull/40295#issuecomment-1457383616

   build timed out but succeeded on rerun: 
https://github.com/vitaliili-db/spark/actions/runs/4346311324/jobs/7598960402


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vitaliili-db commented on pull request #40295: [SPARK-42681] Relax ordering constraint for ALTER TABLE ADD|REPLACE column options

2023-03-06 Thread via GitHub


vitaliili-db commented on PR #40295:
URL: https://github.com/apache/spark/pull/40295#issuecomment-1457384015

   @gengliangwang can you review this please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40303: [SPARK-42656][CONNECT][Followup] Improve the script to start spark-connect server

2023-03-06 Thread via GitHub


hvanhovell closed pull request #40303: [SPARK-42656][CONNECT][Followup] Improve 
the script to start spark-connect server
URL: https://github.com/apache/spark/pull/40303


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40303: [SPARK-42656][CONNECT][Followup] Improve the script to start spark-connect server

2023-03-06 Thread via GitHub


hvanhovell commented on PR #40303:
URL: https://github.com/apache/spark/pull/40303#issuecomment-1457382781

   Merging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-06 Thread via GitHub


LuciferYang commented on PR #40218:
URL: https://github.com/apache/spark/pull/40218#issuecomment-1457373651

   Thanks @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #40280: [SPARK-42671][CONNECT] Fix bug for createDataFrame from complex type schema

2023-03-06 Thread via GitHub


panbingkun commented on PR #40280:
URL: https://github.com/apache/spark/pull/40280#issuecomment-1457349284

   > Thanks @panbingkun for the nice fix! Btw, think I found another 
`createDataFrame` bug which is not working properly with non-nullable schema as 
below:
   > 
   > ```python
   > >>> from pyspark.sql.types import *
   > >>> schema_false = StructType([StructField("id", IntegerType(), False)])
   > >>> spark.createDataFrame([[1]], schema=schema_false)
   > Traceback (most recent call last):
   > ...
   > pyspark.errors.exceptions.connect.AnalysisException: 
[NULLABLE_COLUMN_OR_FIELD] Column or field `id` is nullable while it's required 
to be non-nullable.
   > ```
   > 
   > whereas working find with nullable schema as below:
   > 
   > ```python
   > >>> schema_true = StructType([StructField("id", IntegerType(), True)])
   > >>> spark.createDataFrame([[1]], schema=schema_true)
   > DataFrame[id: int]
   > ```
   > 
   > Do you have any idea what might be causing this? Could you take a look at 
it if you're interested in? I have filed an issue at 
[SPARK-42679](https://issues.apache.org/jira/browse/SPARK-42679).
   > 
   > Also cc @hvanhovell as an original author for `createDataFrame`.
   
   Let me try to investigate it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


mridulm commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1457315803

   The test failure is unrelated, so existing tests work fine - will work on 
specifically checking for the changes in this PR later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40296: [SPARK-42680][CONNECT][TESTS] Create the helper function withSQLConf for connect test framework

2023-03-06 Thread via GitHub


zhengruifeng commented on PR #40296:
URL: https://github.com/apache/spark/pull/40296#issuecomment-1457271633

   @beliefer I think it's not a `new features` mentioned in the PR description


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #38736: [SPARK-41214][SQL] - SQL Metrics are missing from Spark UI when AQE for Cached DataFrame is enabled

2023-03-06 Thread via GitHub


github-actions[bot] closed pull request #38736: [SPARK-41214][SQL] - SQL 
Metrics are missing from Spark UI when AQE for Cached DataFrame is enabled
URL: https://github.com/apache/spark/pull/38736


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin opened a new pull request, #40310: [SPARK-42022][CONNECT][PYTHON] Fix createDataFrame to autogenerate missing column names

2023-03-06 Thread via GitHub


ueshin opened a new pull request, #40310:
URL: https://github.com/apache/spark/pull/40310

   ### What changes were proposed in this pull request?
   
   Fixes `createDataFrame` to autogenerate missing column names.
   
   ### Why are the changes needed?
   
   Currently the number of the column names specified to `createDataFrame` does 
not match the actual number of columns, it raises an error:
   
   ```py
   >>> spark.createDataFrame([["a", "b"]], ["col1"])
   Traceback (most recent call last):
   ...
   ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 
elements
   ```
   
   but it should auto-generate the missing column names.
   
   ### Does this PR introduce _any_ user-facing change?
   
   It will auto-generate the missing columns:
   
   ```py
   >>> spark.createDataFrame([["a", "b"]], ["col1"])
   DataFrame[col1: string, _2: string]
   ```
   
   ### How was this patch tested?
   
   Enabled the related test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #40268: [SPARK-42500][SQL] ConstantPropagation support more cases

2023-03-06 Thread via GitHub


wangyum commented on code in PR #40268:
URL: https://github.com/apache/spark/pull/40268#discussion_r1127193046


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##
@@ -138,56 +136,53 @@ object ConstantPropagation extends Rule[LogicalPlan] {
*case of `WHERE e`, null result of expression `e` means 
the same as if it
*resulted false
* @return A tuple including:
-   * 1. Option[Expression]: optional changed condition after traversal
+   * 1. Expression: optional changed condition after traversal

Review Comment:
   Remove the `optional`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-06 Thread via GitHub


hvanhovell closed pull request #40218: [SPARK-42579][CONNECT] Part-1: 
`function.lit` support `Array[_]` dataType
URL: https://github.com/apache/spark/pull/40218


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40218: [SPARK-42579][CONNECT] Part-1: `function.lit` support `Array[_]` dataType

2023-03-06 Thread via GitHub


hvanhovell commented on PR #40218:
URL: https://github.com/apache/spark/pull/40218#issuecomment-1457206111

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #40306: [SPARK-42687][SS] Better error message for the unsupport `pivot` operation in Streaming

2023-03-06 Thread via GitHub


HeartSaVioR closed pull request #40306: [SPARK-42687][SS] Better error message 
for the unsupport `pivot` operation in Streaming
URL: https://github.com/apache/spark/pull/40306


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40306: [SPARK-42687][SS] Better error message for the unsupport `pivot` operation in Streaming

2023-03-06 Thread via GitHub


HeartSaVioR commented on PR #40306:
URL: https://github.com/apache/spark/pull/40306#issuecomment-1457192756

   Thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on pull request #40305: [SPARK-42656][CONNECT][Followup] Spark Connect Shell

2023-03-06 Thread via GitHub


zhenlineo commented on PR #40305:
URL: https://github.com/apache/spark/pull/40305#issuecomment-1457166376

   If this PR accepted then no need to merge 
https://github.com/apache/spark/pull/40303 as this PR override the changes 
needed there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhenlineo commented on pull request #40303: [SPARK-42656][CONNECT][Followup] Improve the script to start spark-connect server

2023-03-06 Thread via GitHub


zhenlineo commented on PR #40303:
URL: https://github.com/apache/spark/pull/40303#issuecomment-1457165581

   Or even better? -> https://github.com/apache/spark/pull/40305


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40309: [SPARK-42688][CONNECT] Rename Connect proto Request client_id to session_id

2023-03-06 Thread via GitHub


amaliujia commented on PR #40309:
URL: https://github.com/apache/spark/pull/40309#issuecomment-1457104885

   cc @zhengruifeng @HyukjinKwon @grundprinzip 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #40309: [SPARK-42688][CONNECT] Rename Connect proto Request client_id to session_id

2023-03-06 Thread via GitHub


amaliujia opened a new pull request, #40309:
URL: https://github.com/apache/spark/pull/40309

   
   
   ### What changes were proposed in this pull request?
   
   Rename Connect proto Request client_id to session_id.
   
   On the one hand when I read client_id I was confused on what it is used to, 
even after reading the proto documentation.
   
   On the other hand,  client sides already use session_id:
   
https://github.com/apache/spark/blob/9bf174f9722e34f13bfaede5e59f989bf2a511e9/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala#L51
   
https://github.com/apache/spark/blob/9bf174f9722e34f13bfaede5e59f989bf2a511e9/python/pyspark/sql/connect/client.py#L522
   
   
   ### Why are the changes needed?
   
   Code readability
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   Existing UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40289: [SPARK-42478][SQL][3.2] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-06 Thread via GitHub


dongjoon-hyun closed pull request #40289: [SPARK-42478][SQL][3.2] Make a 
serializable jobTrackerId instead of a non-serializable JobID in 
FileWriterFactory
URL: https://github.com/apache/spark/pull/40289


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #40290: [SPARK-42478][SQL][3.3] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-06 Thread via GitHub


dongjoon-hyun closed pull request #40290: [SPARK-42478][SQL][3.3] Make a 
serializable jobTrackerId instead of a non-serializable JobID in 
FileWriterFactory
URL: https://github.com/apache/spark/pull/40290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40290: [SPARK-42478][SQL][3.3] Make a serializable jobTrackerId instead of a non-serializable JobID in FileWriterFactory

2023-03-06 Thread via GitHub


dongjoon-hyun commented on PR #40290:
URL: https://github.com/apache/spark/pull/40290#issuecomment-1457080979

   Merged to branch-3.3. Thank you, @Yikf and @cloud-fan .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127081206


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   This resolution was substantially different compared to what we do in normal 
writes or in data sources that actually support row-level operations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127080574


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignRowLevelCommandAssignments.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, UpdateTable}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+
+/**
+ * A rule that aligns assignments in row-level operations.
+ *
+ * Note that this rule must be run after resolving default values but before 
rewriting row-level
+ * commands into executable plans. This rule does not apply to tables that 
accept any schema.
+ * Such tables must inject their own rules to align assignments.
+ */
+object AlignRowLevelCommandAssignments extends Rule[LogicalPlan] {
+
+  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+case u: UpdateTable if u.resolved && !u.aligned && shouldAlign(u.table) =>
+  val newTable = u.table.transform {
+case r: DataSourceV2Relation =>
+  validateStoreAssignmentPolicy()

Review Comment:
   I follow what we do for V2 inserts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1127079791


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignRowLevelCommandAssignments.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, UpdateTable}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+
+/**
+ * A rule that aligns assignments in row-level operations.
+ *
+ * Note that this rule must be run after resolving default values but before 
rewriting row-level

Review Comment:
   We may need to think about a reliable way to check if default values have 
been resolved. Right now, it simply relies on the order of rules, which is 
fragile.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlignRowLevelCommandAssignments.scala:
##
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, UpdateTable}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.CharVarcharUtils
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.internal.SQLConf.StoreAssignmentPolicy
+
+/**
+ * A rule that aligns assignments in row-level operations.
+ *
+ * Note that this rule must be run after resolving default values but before 
rewriting row-level

Review Comment:
   We need to think about a reliable way to check if default values have been 
resolved. Right now, it simply relies on the order of rules, which is fragile.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


mridulm commented on code in PR #40307:
URL: https://github.com/apache/spark/pull/40307#discussion_r1127076610


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -203,7 +205,8 @@ private[spark] class ExecutorAllocationManager(
   throw new SparkException(
 s"s${DYN_ALLOCATION_SUSTAINED_SCHEDULER_BACKLOG_TIMEOUT.key} must be > 
0!")
 }
-if (!conf.get(config.SHUFFLE_SERVICE_ENABLED)) {
+if (!conf.get(config.SHUFFLE_SERVICE_ENABLED) &&
+  !shuffleDriverComponents.supportsReliableStorage()) {

Review Comment:
   `SHUFFLE_SERVICE_ENABLED` will continue to exist for backward compatibility, 
but imo would be specific for the default implementation for SPARK-25299.
   
   That should be better handled as part of a follow up to SPARK-25299 imo.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi opened a new pull request, #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-06 Thread via GitHub


aokolnychyi opened a new pull request, #40308:
URL: https://github.com/apache/spark/pull/40308

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR adds a rule to align UPDATE assignments with table attributes.
   
   ### Why are the changes needed?
   
   
   These changes are needed so that we can rewrite UPDATE statements into 
executable plans for tables that support row-level operations.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   
   ### How was this patch tested?
   
   
   This PR comes with tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40304: [SPARK-42665][CONNECT][Test] Mute Scala Client UDF test

2023-03-06 Thread via GitHub


amaliujia commented on code in PR #40304:
URL: https://github.com/apache/spark/pull/40304#discussion_r1127070223


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala:
##
@@ -76,7 +76,8 @@ class ClientE2ETestSuite extends RemoteSparkSession {
 assert(result(2) == 2)
   }
 
-  test("simple udf") {
+  // TODO (SPARK-42665): Ignore this test until the udf is fully implemented.
+  ignore("simple udf") {

Review Comment:
   nit: I think the common way to do so is this format:
   
   ```  ignore("SPARK-31855: generate test files for checking compatibility 
with Spark 2.4/3.2") ```.
   
   Basically you leave the SPARK- and the reason in the test name.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a diff in pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


otterc commented on code in PR #40307:
URL: https://github.com/apache/spark/pull/40307#discussion_r1127049718


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -203,7 +205,8 @@ private[spark] class ExecutorAllocationManager(
   throw new SparkException(
 s"s${DYN_ALLOCATION_SUSTAINED_SCHEDULER_BACKLOG_TIMEOUT.key} must be > 
0!")
 }
-if (!conf.get(config.SHUFFLE_SERVICE_ENABLED)) {
+if (!conf.get(config.SHUFFLE_SERVICE_ENABLED) &&
+  !shuffleDriverComponents.supportsReliableStorage()) {

Review Comment:
   Should we not fix the usage of `config.SHUFFLE_SERVICE_ENABLED`? When it is 
`true`, the code everywhere assumes ESS is enabled. Instead there can be a 
different config that points to what kind of shuffle service is enabled and 
what it supports.
   
   Thinking of how this can evolve in future. The remote shuffle service can 
provide additional functionality (ex. supports encryption) and then we will 
have to add methods to `ShuffleDriverComponents`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on a diff in pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


otterc commented on code in PR #40307:
URL: https://github.com/apache/spark/pull/40307#discussion_r1127049718


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -203,7 +205,8 @@ private[spark] class ExecutorAllocationManager(
   throw new SparkException(
 s"s${DYN_ALLOCATION_SUSTAINED_SCHEDULER_BACKLOG_TIMEOUT.key} must be > 
0!")
 }
-if (!conf.get(config.SHUFFLE_SERVICE_ENABLED)) {
+if (!conf.get(config.SHUFFLE_SERVICE_ENABLED) &&
+  !shuffleDriverComponents.supportsReliableStorage()) {

Review Comment:
   Should we not fix the usage of `config.SHUFFLE_SERVICE_ENABLED`? When it is 
`true`, the code everywhere assumes ESS is enabled. Instead there can be a 
different config that points to what kind of shuffle service is enabled. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


dongjoon-hyun commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1457022823

   If you don't mind, please share some results later~ :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40276: [SPARK-42630][CONNECT][PYTHON] Implement data type string parser

2023-03-06 Thread via GitHub


ueshin commented on code in PR #40276:
URL: https://github.com/apache/spark/pull/40276#discussion_r1127045146


##
python/pyspark/sql/connect/types.py:
##
@@ -342,20 +343,325 @@ def from_arrow_schema(arrow_schema: "pa.Schema") -> 
StructType:
 
 
 def parse_data_type(data_type: str) -> DataType:
-# Currently we don't have a way to have a current Spark session in Spark 
Connect, and
-# pyspark.sql.SparkSession has a centralized logic to control the session 
creation.
-# So uses pyspark.sql.SparkSession for now. Should replace this to using 
the current
-# Spark session for Spark Connect in the future.
-from pyspark.sql import SparkSession as PySparkSession
-
-assert is_remote()
-return_type_schema = (
-PySparkSession.builder.getOrCreate().createDataFrame(data=[], 
schema=data_type).schema
+"""
+Parses the given data type string to a :class:`DataType`. The data type 
string format equals
+:class:`DataType.simpleString`, except that the top level struct type can 
omit
+the ``struct<>``. Since Spark 2.3, this also supports a schema in a 
DDL-formatted
+string and case-insensitive strings.
+
+Examples
+
+>>> parse_data_type("int ")
+IntegerType()
+>>> parse_data_type("INT ")
+IntegerType()
+>>> parse_data_type("a: byte, b: decimal(  16 , 8   ) ")
+StructType([StructField('a', ByteType(), True), StructField('b', 
DecimalType(16,8), True)])
+>>> parse_data_type("a DOUBLE, b STRING")
+StructType([StructField('a', DoubleType(), True), StructField('b', 
StringType(), True)])
+>>> parse_data_type("a DOUBLE, b CHAR( 50 )")
+StructType([StructField('a', DoubleType(), True), StructField('b', 
CharType(50), True)])
+>>> parse_data_type("a DOUBLE, b VARCHAR( 50 )")
+StructType([StructField('a', DoubleType(), True), StructField('b', 
VarcharType(50), True)])
+>>> parse_data_type("a: array< short>")
+StructType([StructField('a', ArrayType(ShortType(), True), True)])
+>>> parse_data_type(" map ")
+MapType(StringType(), StringType(), True)
+
+>>> # Error cases
+>>> parse_data_type("blabla") # doctest: +IGNORE_EXCEPTION_DETAIL
+Traceback (most recent call last):
+...
+ParseException:...
+>>> parse_data_type("a: int,") # doctest: +IGNORE_EXCEPTION_DETAIL
+Traceback (most recent call last):
+...
+ParseException:...
+>>> parse_data_type("array>> parse_data_type("map>") # doctest: 
+IGNORE_EXCEPTION_DETAIL
+Traceback (most recent call last):
+...
+ParseException:...
+"""
+try:
+# DDL format, "fieldname datatype, fieldname datatype".
+return DDLSchemaParser(data_type).from_ddl_schema()
+except ParseException as e:
+try:
+# For backwards compatibility, "integer", "struct" and etc.
+return DDLDataTypeParser(data_type).from_ddl_datatype()
+except ParseException:
+try:
+# For backwards compatibility, "fieldname: datatype, 
fieldname: datatype" case.
+return 
DDLDataTypeParser(f"struct<{data_type}>").from_ddl_datatype()
+except ParseException:
+raise e from None
+
+
+class DataTypeParserBase:
+REGEXP_IDENTIFIER: Final[Pattern] = re.compile("\\w+|`(?:``|[^`])*`", 
re.MULTILINE)
+REGEXP_INTEGER_VALUES: Final[Pattern] = re.compile(
+"\\(\\s*(?:[+-]?\\d+)\\s*(?:,\\s*(?:[+-]?\\d+)\\s*)*\\)", re.MULTILINE
 )
-with_col_name = " " in data_type.strip()
-if len(return_type_schema.fields) == 1 and not with_col_name:
-# To match pyspark.sql.types._parse_datatype_string
-return_type = return_type_schema.fields[0].dataType
-else:
-return_type = return_type_schema
-return return_type
+REGEXP_INTERVAL_TYPE: Final[Pattern] = re.compile(
+"(day|hour|minute|second)(?:\\s+to\\s+(hour|minute|second))?", 
re.IGNORECASE | re.MULTILINE
+)
+REGEXP_NOT_NULL_COMMENT: Final[Pattern] = re.compile(
+"(not\\s+null)?(?:(?(1)\\s+)comment\\s+'((?:'|[^'])*)')?", 
re.IGNORECASE | re.MULTILINE
+)
+
+def __init__(self, type_str: str):
+self._type_str = type_str
+self._pos = 0
+self._lstrip()
+
+def _lstrip(self) -> None:
+remaining = self._type_str[self._pos :]
+self._pos = self._pos + (len(remaining) - len(remaining.lstrip()))
+
+def _parse_data_type(self) -> DataType:
+type_str = self._type_str[self._pos :]
+m = self.REGEXP_IDENTIFIER.match(type_str)
+if m:
+data_type_name = m.group(0).lower().strip("`").replace("``", "`")
+self._pos = self._pos + len(m.group(0))
+self._lstrip()
+if data_type_name == "array":
+return self._parse_array_type()
+elif data_type_name == "map":
+return self._parse_map_type()
+elif data_type_name == "struct":
+ 

[GitHub] [spark] mridulm commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


mridulm commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-145754

   We are evaluating it currently @dongjoon-hyun :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] FurcyPin commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-06 Thread via GitHub


FurcyPin commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1126999170


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1268,6 +1268,12 @@ def test_bucket(self):
 message_parameters={"arg_name": "numBuckets", "arg_type": "str"},
 )
 
+def test_no_cast(self):

Review Comment:
   I apologize, but I am not sure I understand what you mean... it sounds like 
you think that `functions.cast` already exists. It does not, there is only 
`Column.cast`. This is what this test does: it checks that `functions.cast` 
does not exists.
   
   We _could_ add a `functions.cast` method, but I don't think we should 
because 
   1. the need is already covered by `Column.cast`
   2. that would require to add it to the Scala/Java API to be consistent



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] FurcyPin commented on a diff in pull request #40271: [WIP][SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-06 Thread via GitHub


FurcyPin commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1126999170


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1268,6 +1268,12 @@ def test_bucket(self):
 message_parameters={"arg_name": "numBuckets", "arg_type": "str"},
 )
 
+def test_no_cast(self):

Review Comment:
   I apologize, but I am not sure I understand what you mean... it sounds like 
you think that `functions.cast` already exists.
   It does not, there is only `Column.cast`.
   
   We _could_ add a `functions.cast` method, but I don't think we should 
because 
   1. the need is already covered by `Column.cast`
   2. that would require to add it to the Scala/Java API to be consistent



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40303: [SPARK-42656][CONNECT][Followup] Improve the script to start spark-connect server

2023-03-06 Thread via GitHub


amaliujia commented on PR #40303:
URL: https://github.com/apache/spark/pull/40303#issuecomment-1456893990

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40307: [DRAFT][SPARK-42689][CORE][SHUFFLE]: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


mridulm commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1456844136

   This is still WIP, but want to get early feedback.
   +CC @Ngone51, @otterc, @waitinfuture


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #40307: [DRAFT][CORE][SHUFFLE]: SPARK-42689: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


mridulm commented on code in PR #40307:
URL: https://github.com/apache/spark/pull/40307#discussion_r1126939110


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -596,6 +591,13 @@ class SparkContext(config: SparkConf) extends Logging {
   _conf.set(APP_ATTEMPT_ID, attemptId)
   _env.blockManager.blockStoreClient.setAppAttemptId(attemptId)
 }
+
+// initialize after application id and attempt id has been initialized

Review Comment:
   Shuffle driver components typically depend on the application/attempt id for 
managing their metadata.
   Initializing it after these values are set in `SparkConcext` allows the 
`ShuffleDriverComponents` to rely on this - instead of using hacks like 
`SparkListener` waiting for application start, etc for the actual 
initialization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm opened a new pull request, #40307: Draft: SPARK-42689: Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-06 Thread via GitHub


mridulm opened a new pull request, #40307:
URL: https://github.com/apache/spark/pull/40307

   ### What changes were proposed in this pull request?
   
   Currently, if there is an executor node loss, we assume the shuffle data on 
that node is also lost. This is not necessarily the case if there is a shuffle 
component managing the shuffle data and reliably maintaining it (for example, 
in distributed filesystem or in a disaggregated shuffle cluster).
   
   ### Why are the changes needed?
   
   Downstream projects have patches to Apache Spark in order to workaround this 
issue, for example Apache Celeborn has 
[this](https://github.com/apache/incubator-celeborn/blob/main/assets/spark-patch/RSS_RDA_spark3.patch).
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Enhances the `ShuffleDriverComponents` API, but defaults to current behavior.
   
   
   ### How was this patch tested?
   
   Existing unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40217: [SPARK-42559][CONNECT] Implement DataFrameNaFunctions

2023-03-06 Thread via GitHub


hvanhovell closed pull request #40217: [SPARK-42559][CONNECT] Implement 
DataFrameNaFunctions
URL: https://github.com/apache/spark/pull/40217


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40217: [SPARK-42559][CONNECT] Implement DataFrameNaFunctions

2023-03-06 Thread via GitHub


hvanhovell commented on PR #40217:
URL: https://github.com/apache/spark/pull/40217#issuecomment-1456804495

   Merging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >