[GitHub] [spark] yaooqinn commented on pull request #42620: [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase

2023-08-22 Thread via GitHub


yaooqinn commented on PR #42620:
URL: https://github.com/apache/spark/pull/42620#issuecomment-1689327389

   thanks, merged to master and 3.5.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn closed pull request #42620: [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase

2023-08-22 Thread via GitHub


yaooqinn closed pull request #42620: [SPARK-44921][SQL] Remove 
SqlBaseLexer.tokens from codebase
URL: https://github.com/apache/spark/pull/42620


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #42619: [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()

2023-08-22 Thread via GitHub


yaooqinn commented on PR #42619:
URL: https://github.com/apache/spark/pull/42619#issuecomment-1689325031

   Thanks @JoshRosen @cloud-fan @HyukjinKwon and @dongjoon-hyun 
   
   Merged to master, '3.5.0', '3.4.2', and '3.3.4'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn closed pull request #42619: [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()

2023-08-22 Thread via GitHub


yaooqinn closed pull request #42619: [SPARK-44920][CORE] Use await() instead of 
awaitUninterruptibly() in TransportClientFactory.createClient()
URL: https://github.com/apache/spark/pull/42619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #42608: [SPARK-42017][PYTHON][CONNECT][TESTS] Enable `ColumnParityTests.test_access_column`

2023-08-22 Thread via GitHub


zhengruifeng commented on code in PR #42608:
URL: https://github.com/apache/spark/pull/42608#discussion_r1302514076


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, 
List, Tuple]) -> Union[Colum
 alias = self._get_alias()
 if self._plan is None:
 raise SparkConnectException("Cannot analyze on empty plan.")
+
+if "." not in item and "*" not in item and "`" not in item and 
item not in self.columns:

Review Comment:
   good idea!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42624: [SPARK-44925][K8S] K8s default service token file should not be materialized into token

2023-08-22 Thread via GitHub


dongjoon-hyun commented on PR #42624:
URL: https://github.com/apache/spark/pull/42624#issuecomment-1689321845

   Merged to master/3.5/3.4/3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #42591: [SPARK-44784][CONNECT] Make SBT testing hermetic.

2023-08-22 Thread via GitHub


LuciferYang commented on code in PR #42591:
URL: https://github.com/apache/spark/pull/42591#discussion_r1302509829


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/test/RemoteSparkSession.scala:
##
@@ -0,0 +1,241 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.test
+
+import java.io.{File, IOException, OutputStream}
+import java.lang.ProcessBuilder
+import java.lang.ProcessBuilder.Redirect
+import java.nio.file.Paths
+import java.util.concurrent.TimeUnit
+
+import scala.concurrent.duration.FiniteDuration
+
+import org.apache.commons.lang3.{JavaVersion, SystemUtils}
+import org.scalactic.source.Position
+import org.scalatest.{BeforeAndAfterAll, Tag}
+
+import org.apache.spark.SparkBuildInfo
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.connect.client.GrpcRetryHandler.RetryPolicy
+import org.apache.spark.sql.connect.client.SparkConnectClient
+import org.apache.spark.sql.connect.common.config.ConnectCommon
+import org.apache.spark.sql.test.IntegrationTestUtils._
+
+/**
+ * An util class to start a local spark connect server in a different process 
for local E2E tests.
+ * Pre-running the tests, the spark connect artifact needs to be built using 
e.g. `build/sbt
+ * package`. It is designed to start the server once but shared by all tests. 
It is equivalent to
+ * use the following command to start the connect server via command line:
+ *
+ * {{{
+ * bin/spark-shell \
+ * --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar | 
paste -sd ',' -` \
+ * --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin
+ * }}}
+ *
+ * Set system property `spark.test.home` or env variable `SPARK_HOME` if the 
test is not executed
+ * from the Spark project top folder. Set system property 
`spark.debug.sc.jvm.client=true` or
+ * environment variable `SPARK_DEBUG_SC_JVM_CLIENT=true` to print the server 
process output in the
+ * console to debug server start stop problems.
+ */
+object SparkConnectServerUtils {
+
+  // Server port
+  val port: Int =
+ConnectCommon.CONNECT_GRPC_BINDING_PORT + util.Random.nextInt(1000)
+
+  @volatile private var stopped = false
+
+  private var consoleOut: OutputStream = _
+  private val serverStopCommand = "q"
+
+  private lazy val sparkConnect: java.lang.Process = {
+debug("Starting the Spark Connect Server...")
+val connectJar = findJar(
+  "connector/connect/server",
+  "spark-connect-assembly",
+  "spark-connect").getCanonicalPath
+
+val command = Seq.newBuilder[String]
+command += "bin/spark-submit"
+command += "--driver-class-path" += connectJar
+command += "--class" += 
"org.apache.spark.sql.connect.SimpleSparkConnectService"
+command += "--conf" += s"spark.connect.grpc.binding.port=$port"
+command ++= testConfigs
+command ++= debugConfigs
+command += connectJar
+val builder = new ProcessBuilder(command.result(): _*)
+builder.directory(new File(sparkHome))
+val environment = builder.environment()
+environment.remove("SPARK_DIST_CLASSPATH")
+if (isDebug) {
+  builder.redirectError(Redirect.INHERIT)
+  builder.redirectOutput(Redirect.INHERIT)
+}
+
+val process = builder.start()
+consoleOut = process.getOutputStream
+
+// Adding JVM shutdown hook
+sys.addShutdownHook(stop())
+process
+  }
+
+  /**
+   * As one shared spark will be started for all E2E tests, for tests that 
needs some special
+   * configs, we add them here
+   */
+  private def testConfigs: Seq[String] = {
+// To find InMemoryTableCatalog for V2 writer tests
+val catalystTestJar =
+  tryFindJar("sql/catalyst", "spark-catalyst", "spark-catalyst", test = 
true)

Review Comment:
   Yes, it will build.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: 

[GitHub] [spark] yaooqinn commented on pull request #42614: [SPARK-44922][TESTS] Disable o.a.p.h.InternalParquetRecordWriter logs for tests

2023-08-22 Thread via GitHub


yaooqinn commented on PR #42614:
URL: https://github.com/apache/spark/pull/42614#issuecomment-1689317257

   Test failures in RocksDBStateStore are frequently occurring and irrelevant.
   
   thanks @dongjoon-hyun @HyukjinKwon 
   
   Merged to master/3.5/3.4.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42624: [SPARK-44925][K8S] K8s default service token file should not be materialized into token

2023-08-22 Thread via GitHub


dongjoon-hyun closed pull request #42624: [SPARK-44925][K8S] K8s default 
service token file should not be materialized into token
URL: https://github.com/apache/spark/pull/42624


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zwangsheng commented on a diff in pull request #42600: [SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of feature steps

2023-08-22 Thread via GitHub


zwangsheng commented on code in PR #42600:
URL: https://github.com/apache/spark/pull/42600#discussion_r1302507782


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala:
##
@@ -117,6 +117,7 @@ private[spark] class KubernetesDriverConf(
 
   override def annotations: Map[String, String] = {
 KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, 
KUBERNETES_DRIVER_ANNOTATION_PREFIX)
+  .map(p => (p._1, Utils.substituteAppNExecIds(p._2, appId, "")))

Review Comment:
   Sorry,  i misunderstood `the consistent short style`.
   
   Does the following code conform
   ```
   .map{ case(k, v) => (k, Utils.substituteAppNExecIds(v, appId, executorId)) }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn closed pull request #42614: [SPARK-44922][TESTS] Disable o.a.p.h.InternalParquetRecordWriter logs for tests

2023-08-22 Thread via GitHub


yaooqinn closed pull request #42614: [SPARK-44922][TESTS] Disable 
o.a.p.h.InternalParquetRecordWriter logs for tests
URL: https://github.com/apache/spark/pull/42614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

2023-08-22 Thread via GitHub


LuciferYang commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1302455703


##
python/pyspark/sql/functions.py:
##
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-of column `col`.
+"""
+Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-.. versionadded:: 2.1.0
+This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+elements in a column or a group of columns.
 
-.. versionchanged:: 3.4.0
-Supports Spark Connect.
+.. versionadded:: 2.1.0
 
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
+The label of the column to count distinct values in.
 rsd : float, optional
-maximum relative standard deviation allowed (default = 0.05).
-For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+The maximum allowed relative standard deviation (default = 0.05).
+If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
 Returns
 ---
 :class:`~pyspark.sql.Column`
-the column of computed results.
+A new Column object representing the approximate unique count.
+
+See Also
+--
+:meth:`pyspark.sql.functions.count_distinct`
 
 Examples
 
+Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+>>> from pyspark.sql.functions import approx_count_distinct
 >>> df = spark.createDataFrame([1,2,2,3], "INT")
 >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
 +---+
 |distinct_values|
 +---+
 |  3|
 +---+
+
+Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], 
"string").toDF("fruit")
+>>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
++---+
+|distinct_fruits|
++---+
+|  3|
++---+
+
+Example 3: Counting distinct values in a DataFrame with multiple columns
+
+>>> from pyspark.sql.functions import approx_count_distinct, struct
+>>> df = spark.createDataFrame([("Alice", 1),
+... ("Alice", 2),
+... ("Bob", 3),
+... ("Bob", 3)], ["name", "value"])
+>>> df = df.withColumn("combined", struct("name", "value"))
+>>> 
df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show()
++--+
+|distinct_pairs|
++--+
+| 3|
++--+
+
+Example 4: Counting distinct values with a specified relative standard 
deviation
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.range(10)
+>>> df.agg(approx_count_distinct("id", 
0.1).alias('distinct_values')).show()

Review Comment:
   Change to
   
   ```
   >>> df.agg(approx_count_distinct("id").alias('with_default_rsd'),
   ...approx_count_distinct("id", 0.1).alias('with_rsd_0.1')).show()
   +++
   |with_default_rsd|with_rsd_0.1|
   +++
   |   95546|  102065|
   +++
   ```
   when I am not sure if this is the `compare ` you mentioned.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42548: [WIP][SPARK-44750][PYTHON][CONNECT] Apply configuration to sparksession during creation

2023-08-22 Thread via GitHub


HyukjinKwon commented on code in PR #42548:
URL: https://github.com/apache/spark/pull/42548#discussion_r1302478884


##
python/pyspark/sql/connect/session.py:
##
@@ -176,6 +180,27 @@ def enableHiveSupport(self) -> "SparkSession.Builder":
 error_class="NOT_IMPLEMENTED", message_parameters={"feature": 
"enableHiveSupport"}
 )
 
+def _apply_options(self, session: "SparkSession") -> None:

Review Comment:
   BTW, we will have to fix 
`pyspark.sql.session.SparkSession.Builder.getOrCreate` too to pass the 
configurations properly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42616: [SPARK-44840][SQL][3.5] Make `array_insert()` 1-based for negative indexes

2023-08-22 Thread via GitHub


dongjoon-hyun closed pull request #42616: [SPARK-44840][SQL][3.5] Make 
`array_insert()` 1-based for negative indexes
URL: https://github.com/apache/spark/pull/42616


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42616: [SPARK-44840][SQL][3.5] Make `array_insert()` 1-based for negative indexes

2023-08-22 Thread via GitHub


dongjoon-hyun commented on PR #42616:
URL: https://github.com/apache/spark/pull/42616#issuecomment-1689272709

   Merged to branch-3.5. Thank you, @MaxGekk and all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42619: [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()

2023-08-22 Thread via GitHub


dongjoon-hyun commented on PR #42619:
URL: https://github.com/apache/spark/pull/42619#issuecomment-1689271967

   Could you re-trigger the failed test cases although it looks irrelevant, 
@JoshRosen ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42600: [SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of feature steps

2023-08-22 Thread via GitHub


dongjoon-hyun commented on PR #42600:
URL: https://github.com/apache/spark/pull/42600#issuecomment-1689271048

   Thank you for update, @zwangsheng . It seems that there is a minor 
miscommunication in the above. I added two comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42600: [SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of feature steps

2023-08-22 Thread via GitHub


dongjoon-hyun commented on code in PR #42600:
URL: https://github.com/apache/spark/pull/42600#discussion_r1302473735


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala:
##
@@ -117,6 +117,7 @@ private[spark] class KubernetesDriverConf(
 
   override def annotations: Map[String, String] = {
 KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, 
KUBERNETES_DRIVER_ANNOTATION_PREFIX)
+  .map(p => (p._1, Utils.substituteAppNExecIds(p._2, appId, "")))

Review Comment:
   Sorry, but what I suggested 
(https://github.com/apache/spark/pull/42600#discussion_r1302137555) is the 
following style instead of `p => ...`.
   ```scala
   case (k, v) => ...
   ```



##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala:
##
@@ -188,6 +189,7 @@ private[spark] class KubernetesExecutorConf(
 
   override def annotations: Map[String, String] = {
 KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, 
KUBERNETES_EXECUTOR_ANNOTATION_PREFIX)
+  .map(p => (p._1, Utils.substituteAppNExecIds(p._2, appId, executorId)))

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #42531: [SPARK-44846][SQL] Pull out complex grouping expressions after remove redundant aggregates

2023-08-22 Thread via GitHub


cloud-fan commented on code in PR #42531:
URL: https://github.com/apache/spark/pull/42531#discussion_r1302472237


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala:
##
@@ -42,11 +42,12 @@ object RemoveRedundantAggregates extends Rule[LogicalPlan] 
with AliasHelper {
   )
 
   // We might have introduces non-deterministic grouping expression

Review Comment:
   I have a different idea. I think it's risky to put the new grouping 
expressions back to the `Aggregate`, as the grouping expression contains things 
in the SELECT list. This is a long-standing issue and I feel it's better to 
just create a `Project` below the `Aggregate` to calculate grouping 
expressions, and other optimizer rules can merge/eliminate this extra `Project` 
if it only contains `Attributes`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn opened a new pull request, #42625: [SPARK-44802][INFRA][FOLLOWUP] Eagerly check if the token is valid to align with the behavior of username/password authn

2023-08-22 Thread via GitHub


yaooqinn opened a new pull request, #42625:
URL: https://github.com/apache/spark/pull/42625

   
   
   
   ### What changes were proposed in this pull request?
   
   
   The SPARK-44802 now allows for token authentication when resolving Jira 
issues in pull request merging. However, the token auth is kinda lazy during 
the initial handshake, maintainers might get confused someday.
   
   This pull request promptly calls the current_user() function to initiate 
authentication and provides clear instructions for token expiration.
   
   ### Why are the changes needed?
   
   
   make it easy for maintainers to update their expired Jira tokens.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   no, for maintainers
   
   ### How was this patch tested?
   
   
   locally verified the code snippet
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #42567: [SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full

2023-08-22 Thread via GitHub


HeartSaVioR closed pull request #42567: [SPARK-44878][SS] Disable strict limit 
for RocksDB write manager to avoid insertion exception on cache full
URL: https://github.com/apache/spark/pull/42567


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #42567: [SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full

2023-08-22 Thread via GitHub


HeartSaVioR commented on PR #42567:
URL: https://github.com/apache/spark/pull/42567#issuecomment-1689252968

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42548: [WIP][SPARK-44750][PYTHON][CONNECT] Apply configuration to sparksession during creation

2023-08-22 Thread via GitHub


HyukjinKwon commented on code in PR #42548:
URL: https://github.com/apache/spark/pull/42548#discussion_r1302457534


##
python/pyspark/sql/connect/session.py:
##
@@ -176,6 +180,27 @@ def enableHiveSupport(self) -> "SparkSession.Builder":
 error_class="NOT_IMPLEMENTED", message_parameters={"feature": 
"enableHiveSupport"}
 )
 
+def _apply_options(self, session: "SparkSession") -> None:
+with self._lock:
+for k, v in self._options.items():
+try:
+session.conf.set(k, v)
+except AnalysisException as e:
+current = currentframe()

Review Comment:
   What are you trying to do here BTW?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42548: [WIP][SPARK-44750][PYTHON][CONNECT] Apply configuration to sparksession during creation

2023-08-22 Thread via GitHub


HyukjinKwon commented on code in PR #42548:
URL: https://github.com/apache/spark/pull/42548#discussion_r1302457329


##
python/pyspark/sql/connect/session.py:
##
@@ -176,6 +180,27 @@ def enableHiveSupport(self) -> "SparkSession.Builder":
 error_class="NOT_IMPLEMENTED", message_parameters={"feature": 
"enableHiveSupport"}
 )
 
+def _apply_options(self, session: "SparkSession") -> None:
+with self._lock:
+for k, v in self._options.items():
+try:
+session.conf.set(k, v)
+except AnalysisException as e:
+current = currentframe()
+lineno = getframeinfo(current).lineno + 1 if current 
is not None else 0
+print(

Review Comment:
   This outer `print` won't be needed. You could directly use `warnings.warn`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

2023-08-22 Thread via GitHub


LuciferYang commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1302445423


##
python/pyspark/sql/functions.py:
##
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-of column `col`.
+"""
+Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-.. versionadded:: 2.1.0
+This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+elements in a column or a group of columns.
 
-.. versionchanged:: 3.4.0
-Supports Spark Connect.
+.. versionadded:: 2.1.0
 
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
+The label of the column to count distinct values in.
 rsd : float, optional
-maximum relative standard deviation allowed (default = 0.05).
-For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+The maximum allowed relative standard deviation (default = 0.05).
+If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
 Returns
 ---
 :class:`~pyspark.sql.Column`
-the column of computed results.
+A new Column object representing the approximate unique count.
+
+See Also
+--
+:meth:`pyspark.sql.functions.count_distinct`
 
 Examples
 
+Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+>>> from pyspark.sql.functions import approx_count_distinct
 >>> df = spark.createDataFrame([1,2,2,3], "INT")
 >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
 +---+
 |distinct_values|
 +---+
 |  3|
 +---+
+
+Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], 
"string").toDF("fruit")

Review Comment:
   change to `df = spark.createDataFrame([("apple",), ("orange",), ("apple",), 
("banana",)], ['fruit'])`, Is that so?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

2023-08-22 Thread via GitHub


LuciferYang commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1302445423


##
python/pyspark/sql/functions.py:
##
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-of column `col`.
+"""
+Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-.. versionadded:: 2.1.0
+This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+elements in a column or a group of columns.
 
-.. versionchanged:: 3.4.0
-Supports Spark Connect.
+.. versionadded:: 2.1.0
 
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
+The label of the column to count distinct values in.
 rsd : float, optional
-maximum relative standard deviation allowed (default = 0.05).
-For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+The maximum allowed relative standard deviation (default = 0.05).
+If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
 Returns
 ---
 :class:`~pyspark.sql.Column`
-the column of computed results.
+A new Column object representing the approximate unique count.
+
+See Also
+--
+:meth:`pyspark.sql.functions.count_distinct`
 
 Examples
 
+Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+>>> from pyspark.sql.functions import approx_count_distinct
 >>> df = spark.createDataFrame([1,2,2,3], "INT")
 >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
 +---+
 |distinct_values|
 +---+
 |  3|
 +---+
+
+Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], 
"string").toDF("fruit")

Review Comment:
   change to `spark.createDataFrame([("apple"), ("orange"), ("apple"), 
("banana")], ['fruit'])`, Is that so?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

2023-08-22 Thread via GitHub


LuciferYang commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1302455703


##
python/pyspark/sql/functions.py:
##
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-of column `col`.
+"""
+Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-.. versionadded:: 2.1.0
+This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+elements in a column or a group of columns.
 
-.. versionchanged:: 3.4.0
-Supports Spark Connect.
+.. versionadded:: 2.1.0
 
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
+The label of the column to count distinct values in.
 rsd : float, optional
-maximum relative standard deviation allowed (default = 0.05).
-For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+The maximum allowed relative standard deviation (default = 0.05).
+If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
 Returns
 ---
 :class:`~pyspark.sql.Column`
-the column of computed results.
+A new Column object representing the approximate unique count.
+
+See Also
+--
+:meth:`pyspark.sql.functions.count_distinct`
 
 Examples
 
+Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+>>> from pyspark.sql.functions import approx_count_distinct
 >>> df = spark.createDataFrame([1,2,2,3], "INT")
 >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
 +---+
 |distinct_values|
 +---+
 |  3|
 +---+
+
+Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], 
"string").toDF("fruit")
+>>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
++---+
+|distinct_fruits|
++---+
+|  3|
++---+
+
+Example 3: Counting distinct values in a DataFrame with multiple columns
+
+>>> from pyspark.sql.functions import approx_count_distinct, struct
+>>> df = spark.createDataFrame([("Alice", 1),
+... ("Alice", 2),
+... ("Bob", 3),
+... ("Bob", 3)], ["name", "value"])
+>>> df = df.withColumn("combined", struct("name", "value"))
+>>> 
df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show()
++--+
+|distinct_pairs|
++--+
+| 3|
++--+
+
+Example 4: Counting distinct values with a specified relative standard 
deviation
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.range(10)
+>>> df.agg(approx_count_distinct("id", 
0.1).alias('distinct_values')).show()

Review Comment:
   Change to xx
   
   ```
   >>> df.agg(approx_count_distinct("id").alias('with_default_rsd'),
   ...approx_count_distinct("id", 0.1).alias('with_rsd_0.1')).show()
   +++
   |with_default_rsd|with_rsd_0.1|
   +++
   |   95546|  102065|
   +++
   ```
   when I am not sure if this is the `compare ` you mentioned.
   



##
python/pyspark/sql/functions.py:
##
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-of column `col`.
+"""
+Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-.. versionadded:: 2.1.0
+This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+elements in a column or a group of columns.

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:

[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-08-22 Thread via GitHub


beliefer commented on code in PR #42612:
URL: https://github.com/apache/spark/pull/42612#discussion_r1302453549


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala:
##
@@ -279,7 +282,8 @@ case class StaticInvoke(
 inputTypes: Seq[AbstractDataType] = Nil,
 propagateNull: Boolean = true,
 returnNullable: Boolean = true,
-isDeterministic: Boolean = true) extends InvokeLike {
+isDeterministic: Boolean = true,
+scalarFunction: Option[ScalarFunction[_]] = None) extends InvokeLike {

Review Comment:
   Please refer 
https://github.com/apache/spark/pull/42612#discussion_r1302453275



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-08-22 Thread via GitHub


beliefer commented on code in PR #42612:
URL: https://github.com/apache/spark/pull/42612#discussion_r1302453275


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala:
##
@@ -159,7 +159,7 @@ object V2ExpressionUtils extends SQLConfHelper with Logging 
{
 StaticInvoke(scalarFunc.getClass, scalarFunc.resultType(),
   MAGIC_METHOD_NAME, arguments, inputTypes = declaredInputTypes,
   propagateNull = false, returnNullable = scalarFunc.isResultNullable,
-  isDeterministic = scalarFunc.isDeterministic)
+  isDeterministic = scalarFunc.isDeterministic, scalarFunction = 
Some(scalarFunc))

Review Comment:
   Because spark already construct the StaticInvoke by calling these methods. 
e.g. `scalarFunc.getClass`, `scalarFunc.resultType()` and so on.
   Please keep the same way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42608: [SPARK-42017][PYTHON][CONNECT][TESTS] Enable `ColumnParityTests.test_access_column`

2023-08-22 Thread via GitHub


HyukjinKwon commented on code in PR #42608:
URL: https://github.com/apache/spark/pull/42608#discussion_r1302449910


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, 
List, Tuple]) -> Union[Colum
 alias = self._get_alias()
 if self._plan is None:
 raise SparkConnectException("Cannot analyze on empty plan.")
+
+if "." not in item and "*" not in item and "`" not in item and 
item not in self.columns:

Review Comment:
   Should we maybe just:
   
   ```python
   col = _to_col_with_plan_id(
   col=alias if alias is not None else item,
   plan_id=self._plan._plan_id,
   )
   self.select(col)
   return col
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ConeyLiu commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-08-22 Thread via GitHub


ConeyLiu commented on code in PR #42612:
URL: https://github.com/apache/spark/pull/42612#discussion_r1302448847


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala:
##
@@ -279,7 +282,8 @@ case class StaticInvoke(
 inputTypes: Seq[AbstractDataType] = Nil,
 propagateNull: Boolean = true,
 returnNullable: Boolean = true,
-isDeterministic: Boolean = true) extends InvokeLike {
+isDeterministic: Boolean = true,
+scalarFunction: Option[ScalarFunction[_]] = None) extends InvokeLike {

Review Comment:
   We need both `name` and `canonicalName` to build `UserDefinedScalarFunc`. So 
we need to add two extra parameters. Wouldn't it be more intuitive to use 
`ScalarFunction`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

2023-08-22 Thread via GitHub


LuciferYang commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1302446464


##
python/pyspark/sql/functions.py:
##
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-of column `col`.
+"""
+Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-.. versionadded:: 2.1.0
+This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+elements in a column or a group of columns.
 
-.. versionchanged:: 3.4.0
-Supports Spark Connect.
+.. versionadded:: 2.1.0
 
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
+The label of the column to count distinct values in.
 rsd : float, optional
-maximum relative standard deviation allowed (default = 0.05).
-For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+The maximum allowed relative standard deviation (default = 0.05).
+If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
 Returns
 ---
 :class:`~pyspark.sql.Column`
-the column of computed results.
+A new Column object representing the approximate unique count.
+
+See Also
+--
+:meth:`pyspark.sql.functions.count_distinct`
 
 Examples
 
+Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+>>> from pyspark.sql.functions import approx_count_distinct
 >>> df = spark.createDataFrame([1,2,2,3], "INT")
 >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
 +---+
 |distinct_values|
 +---+
 |  3|
 +---+
+
+Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], 
"string").toDF("fruit")
+>>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
++---+
+|distinct_fruits|
++---+
+|  3|
++---+
+
+Example 3: Counting distinct values in a DataFrame with multiple columns
+
+>>> from pyspark.sql.functions import approx_count_distinct, struct
+>>> df = spark.createDataFrame([("Alice", 1),
+... ("Alice", 2),
+... ("Bob", 3),
+... ("Bob", 3)], ["name", "value"])
+>>> df = df.withColumn("combined", struct("name", "value"))
+>>> 
df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show()

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

2023-08-22 Thread via GitHub


LuciferYang commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1302445841


##
python/pyspark/sql/functions.py:
##
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-of column `col`.
+"""
+Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-.. versionadded:: 2.1.0
+This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+elements in a column or a group of columns.
 
-.. versionchanged:: 3.4.0
-Supports Spark Connect.
+.. versionadded:: 2.1.0
 
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
+The label of the column to count distinct values in.
 rsd : float, optional
-maximum relative standard deviation allowed (default = 0.05).
-For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+The maximum allowed relative standard deviation (default = 0.05).
+If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
 Returns
 ---
 :class:`~pyspark.sql.Column`
-the column of computed results.
+A new Column object representing the approximate unique count.
+
+See Also
+--
+:meth:`pyspark.sql.functions.count_distinct`
 
 Examples
 
+Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+>>> from pyspark.sql.functions import approx_count_distinct
 >>> df = spark.createDataFrame([1,2,2,3], "INT")

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

2023-08-22 Thread via GitHub


LuciferYang commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1302445423


##
python/pyspark/sql/functions.py:
##
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-of column `col`.
+"""
+Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-.. versionadded:: 2.1.0
+This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+elements in a column or a group of columns.
 
-.. versionchanged:: 3.4.0
-Supports Spark Connect.
+.. versionadded:: 2.1.0
 
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
 Parameters
 --
 col : :class:`~pyspark.sql.Column` or str
+The label of the column to count distinct values in.
 rsd : float, optional
-maximum relative standard deviation allowed (default = 0.05).
-For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+The maximum allowed relative standard deviation (default = 0.05).
+If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
 Returns
 ---
 :class:`~pyspark.sql.Column`
-the column of computed results.
+A new Column object representing the approximate unique count.
+
+See Also
+--
+:meth:`pyspark.sql.functions.count_distinct`
 
 Examples
 
+Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+>>> from pyspark.sql.functions import approx_count_distinct
 >>> df = spark.createDataFrame([1,2,2,3], "INT")
 >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
 +---+
 |distinct_values|
 +---+
 |  3|
 +---+
+
+Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+>>> from pyspark.sql.functions import approx_count_distinct
+>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], 
"string").toDF("fruit")

Review Comment:
   like `spark.createDataFrame([("apple"), ("orange"), ("apple"), ("banana")], 
['fruit'])`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42624: [SPARK-44925][K8S] K8s default service token file should not be materialized into token

2023-08-22 Thread via GitHub


dongjoon-hyun commented on PR #42624:
URL: https://github.com/apache/spark/pull/42624#issuecomment-1689229669

   Thank you so much, @yaooqinn !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #42549: [SPARK-44860][SQL] Add SESSION_USER function

2023-08-22 Thread via GitHub


yaooqinn commented on code in PR #42549:
URL: https://github.com/apache/spark/pull/42549#discussion_r1302442801


##
sql/core/src/test/scala/org/apache/spark/sql/MiscFunctionsSuite.scala:
##
@@ -48,15 +48,16 @@ class MiscFunctionsSuite extends QueryTest with 
SharedSparkSession {
 checkAnswer(df.selectExpr("version()"), df.select(version()))
   }
 
-  test("SPARK-21957: get current_user in normal spark apps") {
+  test("SPARK-21957, SPARK-44860: get current_user, session_user in normal 
spark apps") {

Review Comment:
   Can we also update `SPARK-21957: get current_user through thrift server`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #42549: [SPARK-44860][SQL] Add SESSION_USER function

2023-08-22 Thread via GitHub


yaooqinn commented on code in PR #42549:
URL: https://github.com/apache/spark/pull/42549#discussion_r1302442192


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##
@@ -95,12 +95,13 @@ trait ColumnResolutionHelper extends Logging {
 }
   }
 
-// support CURRENT_DATE, CURRENT_TIMESTAMP, and grouping__id
+  // support CURRENT_DATE, CURRENT_TIMESTAMP, CURRENT_USER, SESSION_USER and 
grouping__id

Review Comment:
   ```suggestion
 // support CURRENT_DATE, CURRENT_TIMESTAMP, CURRENT_USER, USER, 
SESSION_USER and grouping__id
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method

2023-08-22 Thread via GitHub


beliefer commented on code in PR #42612:
URL: https://github.com/apache/spark/pull/42612#discussion_r1302441432


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala:
##
@@ -279,7 +282,8 @@ case class StaticInvoke(
 inputTypes: Seq[AbstractDataType] = Nil,
 propagateNull: Boolean = true,
 returnNullable: Boolean = true,
-isDeterministic: Boolean = true) extends InvokeLike {
+isDeterministic: Boolean = true,
+scalarFunction: Option[ScalarFunction[_]] = None) extends InvokeLike {

Review Comment:
   It seems you just try to get the name of scalar function, shall we only add 
the name parameters ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zwangsheng commented on pull request #42600: [SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of feature steps

2023-08-22 Thread via GitHub


zwangsheng commented on PR #42600:
URL: https://github.com/apache/spark/pull/42600#issuecomment-1689213258

   Thanks for your review @dongjoon-hyun! 
   
   Addressed, let me know if anything missed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request, #42624: [SPARK-44925][K8S] K8s default service token file should not be materialized into token

2023-08-22 Thread via GitHub


dongjoon-hyun opened a new pull request, #42624:
URL: https://github.com/apache/spark/pull/42624

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ragnarok56 opened a new pull request, #42623: [SPARK-44924][SS] Add config for FileStreamSource cached files

2023-08-22 Thread via GitHub


ragnarok56 opened a new pull request, #42623:
URL: https://github.com/apache/spark/pull/42623

   ### What changes were proposed in this pull request?
   This change adds configuration options for the streaming input File Source 
for `maxCachedFiles` and `discardCachedFilesRatio`.  These values were 
originally introduced with https://github.com/apache/spark/pull/27620 but were 
hardcoded to 10,000 and 0.2, respectively.
   
   ### Why are the changes needed?
   Under certain workloads with large `maxFilesPerTrigger` settings, the 
performance gain from caching the input files capped at 10,000 can cause a 
cluster to be underutilized and jobs to take longer to finish if each batch 
takes a while to finish.  For example, a job with `maxFilesPerTrigger` set to 
100,000 would do all 100k in batch 1, then only 10k in batch 2, but both 
batches could take just as long since some of the files cause skewed processing 
times.  This results in a cluster spending nearly the same amount of time while 
processing only 1/10 of the files it could have.
   
   ### Does this PR introduce _any_ user-facing change?
   Updated documentation for structured streaming sources to describe new 
configurations options
   
   ### How was this patch tested?
   New and existing unit tests. 
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files

2023-08-22 Thread via GitHub


panbingkun commented on PR #42622:
URL: https://github.com/apache/spark/pull/42622#issuecomment-1689197410

   > in what case will this be a problem?
   > I don't see similar doc build failures in CI of master and branch-3.5
   
   1. The docs file has been generated locally.
   2. At this point, the pyspark code has made changes, such as deleting the 
function 'chr'`.
   3. Execute the command to generate docs files locally, and an error will 
occur at this time.
   4. 
   The reason why it doesn't happen on GA is because it's always a new 
generation on GA.
   
   In `conf.py`, both directories `source\reference\api` and 
`reference\pyspark.pandas\api` have been cleaned, and we should maintain 
consistency: `reference\pyspark.sql\api` and `reference\pyspark.ss\api`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a diff in pull request #41785: [SPARK-44241][Core] Mistakenly set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause incessant executo

2023-08-22 Thread via GitHub


yaooqinn commented on code in PR #41785:
URL: https://github.com/apache/spark/pull/41785#discussion_r1302410752


##
common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java:
##
@@ -276,10 +277,19 @@ public void initChannel(SocketChannel ch) {
 // Connect to the remote server
 long preConnect = System.nanoTime();
 ChannelFuture cf = bootstrap.connect(address);
-if (!cf.await(conf.connectionCreationTimeoutMs())) {
+
+if (connCreateTimeout <= 0) {
+  cf.awaitUninterruptibly();

Review Comment:
   Thank you @JoshRosen. The fix looks good to me.
   
   While considering the usage of `await` and `awaitUninterruptibly`, I 
consulted the Netty documentation 
(https://netty.io/4.0/api/io/netty/channel/ChannelFuture.html) which 
recommended the use of `awaitUninterruptibly` in all its GOOD examples. So...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #42608: [SPARK-42017][PYTHON][CONNECT][TESTS] Enable `ColumnParityTests.test_access_column`

2023-08-22 Thread via GitHub


zhengruifeng commented on code in PR #42608:
URL: https://github.com/apache/spark/pull/42608#discussion_r1302385936


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, 
List, Tuple]) -> Union[Colum
 alias = self._get_alias()
 if self._plan is None:
 raise SparkConnectException("Cannot analyze on empty plan.")
+
+if "." not in item and "*" not in item and "`" not in item and 
item not in self.columns:

Review Comment:
   I now hesitate whether this is worthwhile @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files

2023-08-22 Thread via GitHub


zhengruifeng commented on PR #42622:
URL: https://github.com/apache/spark/pull/42622#issuecomment-1689179825

   @panbingkun in what case will this be a problem?
   I don't see similar doc build failures in CI of master and branch-3.5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #42601: [SPARK-44905][SQL] Stateful lastRegex causes NullPointerException on eval for regexp_replace

2023-08-22 Thread via GitHub


yaooqinn commented on PR #42601:
URL: https://github.com/apache/spark/pull/42601#issuecomment-1689173818

   thanks @cloud-fan, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files

2023-08-22 Thread via GitHub


panbingkun commented on PR #42622:
URL: https://github.com/apache/spark/pull/42622#issuecomment-1689173498

   cc @zhengruifeng @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files

2023-08-22 Thread via GitHub


panbingkun commented on PR #42622:
URL: https://github.com/apache/spark/pull/42622#issuecomment-1689173078

   As shown in the following figure, during the document generation process of 
`sphinx-build`, some directories and files will be automatically generated in 
the directory: 
   https://github.com/apache/spark/assets/15246973/c679c9d3-a010-4884-8f96-f56bc5fcde4c;>
   https://github.com/apache/spark/assets/15246973/d04b4330-8784-46a3-9a58-cb27b75dcc97;>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn closed pull request #42601: [SPARK-44905][SQL] Stateful lastRegex causes NullPointerException on eval for regexp_replace

2023-08-22 Thread via GitHub


yaooqinn closed pull request #42601: [SPARK-44905][SQL] Stateful lastRegex 
causes NullPointerException on eval for regexp_replace
URL: https://github.com/apache/spark/pull/42601


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files

2023-08-22 Thread via GitHub


panbingkun opened a new pull request, #42622:
URL: https://github.com/apache/spark/pull/42622

   ### What changes were proposed in this pull request?
   The pr aims to fix some bug in regenerating pyspark docs in certain 
scenarios.
   
   ### Why are the changes needed?
   - The following error occurred while I was regenerating the pyspark document.
  https://github.com/apache/spark/assets/15246973/548abd63-4349-4267-b1fe-a293bd1e7f3e;>
   
   - We can simply reproduce this problem as follows:
1.git reset --hard 3f380b9ecc8b27f6965b554061572e0990f0513
   https://github.com/apache/spark/assets/15246973/5ab9c8fc-5835-4ced-8d92-9d5e020b262a;>
2.make clean html, at this point, it is successful.
   https://github.com/apache/spark/assets/15246973/5c3ce07f-cbe8-4177-ae22-b16c3fc62e01;>
   3.git pull
   4.make clean html, at this point, it is failed.
   https://github.com/apache/spark/assets/15246973/548abd63-4349-4267-b1fe-a293bd1e7f3e;>
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   1.Pass GA.
   2.Manually test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #42614: [SPARK-44922][TESTS] Disable o.a.p.h.InternalParquetRecordWriter logs for tests

2023-08-22 Thread via GitHub


yaooqinn commented on PR #42614:
URL: https://github.com/apache/spark/pull/42614#issuecomment-1689169814

   Hi @dongjoon-hyun @HyukjinKwon, thanks for the suggestion, the associate 
JIRA ticket has been attached.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #42621: [SPARK-43567][FOLLOWUP] Missing backtick from migration guide

2023-08-22 Thread via GitHub


HyukjinKwon closed pull request #42621: [SPARK-43567][FOLLOWUP] Missing 
backtick from migration guide
URL: https://github.com/apache/spark/pull/42621


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #42621: [SPARK-43567][FOLLOWUP] Missing backtick from migration guide

2023-08-22 Thread via GitHub


HyukjinKwon commented on PR #42621:
URL: https://github.com/apache/spark/pull/42621#issuecomment-1689170414

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #42428: [SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site

2023-08-22 Thread via GitHub


HyukjinKwon closed pull request #42428: [SPARK-44742][PYTHON][DOCS] Add Spark 
version drop down to the PySpark doc site
URL: https://github.com/apache/spark/pull/42428


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #42428: [SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site

2023-08-22 Thread via GitHub


HyukjinKwon commented on PR #42428:
URL: https://github.com/apache/spark/pull/42428#issuecomment-1689168915

   Alright, let's see how it gose.
   
   Merged to master and branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic opened a new pull request, #42621: [SPARK-43567][FOLLOWUP] Missing backtick from migration guide

2023-08-22 Thread via GitHub


itholic opened a new pull request, #42621:
URL: https://github.com/apache/spark/pull/42621

   
   
   ### What changes were proposed in this pull request?
   
   This is followup for https://github.com/apache/spark/pull/42270.
   
   
   ### Why are the changes needed?
   
   
   To render the migration guide properly.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No, it's documentation fix.
   
   ### How was this patch tested?
   
   
   The existing CI should pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #42594: [SPARK-44839][SS][CONNECT] Better Error Logging when user tries to serialize spark session

2023-08-22 Thread via GitHub


itholic commented on code in PR #42594:
URL: https://github.com/apache/spark/pull/42594#discussion_r1302374470


##
python/pyspark/sql/connect/streaming/query.py:
##
@@ -237,7 +239,13 @@ def addListener(self, listener: StreamingQueryListener) -> 
None:
 listener._init_listener_id()
 cmd = pb2.StreamingQueryManagerCommand()
 expr = proto.PythonUDF()
-expr.command = CloudPickleSerializer().dumps(listener)
+try:
+expr.command = CloudPickleSerializer().dumps(listener)
+except pickle.PicklingError:
+raise PySparkRuntimeError(

Review Comment:
   > define a new PySparkPicklingError and replace this PySparkRuntimeError 
with that right?
   
   Correct :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #42617: [SPARK-44918][SQL][PYTHON] Support named arguments in scalar Python/Pandas UDFs

2023-08-22 Thread via GitHub


ueshin commented on code in PR #42617:
URL: https://github.com/apache/spark/pull/42617#discussion_r1302373707


##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala:
##
@@ -50,6 +50,19 @@ case class UserDefinedPythonFunction(
 udfDeterministic: Boolean) {
 
   def builder(e: Seq[Expression]): Expression = {
+if (pythonEvalType == PythonEvalType.SQL_BATCHED_UDF
+|| pythonEvalType ==PythonEvalType.SQL_ARROW_BATCHED_UDF
+|| pythonEvalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF) {
+  /*
+   * Check if the named arguments:
+   * - don't have duplicated names
+   * - don't contain positional arguments

Review Comment:
   The third item is not right for Python UDF/UDTF. Currently it relies on 
Python.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40932: [SPARK-43266][SQL] Move MergeScalarSubqueries to spark-sql

2023-08-22 Thread via GitHub


beliefer commented on PR #40932:
URL: https://github.com/apache/spark/pull/40932#issuecomment-1689157078

   @peter-toth Got it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #42608: [SPARK-42017][PYTHON][CONNECT][TESTS] Enable `ColumnParityTests.test_access_column`

2023-08-22 Thread via GitHub


zhengruifeng commented on code in PR #42608:
URL: https://github.com/apache/spark/pull/42608#discussion_r1302368571


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, 
List, Tuple]) -> Union[Colum
 alias = self._get_alias()
 if self._plan is None:
 raise SparkConnectException("Cannot analyze on empty plan.")
+
+if "." not in item and "*" not in item and "`" not in item and 
item not in self.columns:

Review Comment:
   similar to 
https://github.com/apache/spark/commit/91e97f92fe76f9718cd16af0c761d5530bdb37ee,
 but need to cases like `*`, `a.*`, ``aa.bb``



##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, 
List, Tuple]) -> Union[Colum
 alias = self._get_alias()
 if self._plan is None:
 raise SparkConnectException("Cannot analyze on empty plan.")
+
+if "." not in item and "*" not in item and "`" not in item and 
item not in self.columns:

Review Comment:
   similar to 
https://github.com/apache/spark/commit/91e97f92fe76f9718cd16af0c761d5530bdb37ee,
 but need to skip cases like `*`, `a.*`, ``aa.bb``



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #42618: [SPARK-44919] Avro connector: convert a union of a single primitive type to a StructType

2023-08-22 Thread via GitHub


sadikovi commented on code in PR #42618:
URL: https://github.com/apache/spark/pull/42618#discussion_r1302363525


##
connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala:
##
@@ -142,18 +143,30 @@ object SchemaConverters {
 if (avroSchema.getTypes.asScala.exists(_.getType == NULL)) {
   // In case of a union with null, eliminate it and make a recursive 
call
   val remainingUnionTypes = AvroUtils.nonNullUnionBranches(avroSchema)
-  if (remainingUnionTypes.size == 1) {
-toSqlTypeHelper(remainingUnionTypes.head, existingRecordNames, 
avroOptions)
-  .copy(nullable = true)
-  } else {
-toSqlTypeHelper(
-  Schema.createUnion(remainingUnionTypes.asJava),
-  existingRecordNames,
-  avroOptions).copy(nullable = true)
-  }
+  toSqlTypeHelper(
+Schema.createUnion(remainingUnionTypes.asJava),
+existingRecordNames,
+avroOptions).copy(nullable = true)
 } else avroSchema.getTypes.asScala.map(_.getType).toSeq match {
   case Seq(t1) =>
-toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, 
avroOptions)
+// If spark.sql.avro.alwaysConvertUnionToStructType is set to 
false (default),
+// we convert Avro union with a single primitive type into a 
primitive Spark type
+// instead of a StructType.
+if (!SQLConf.get.avroAlwaysConvertUnionToStruct) {
+  toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, 
avroOptions)

Review Comment:
   Maybe it is good to compare the two approaches and see which one makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #42618: [SPARK-44919] Avro connector: convert a union of a single primitive type to a StructType

2023-08-22 Thread via GitHub


sadikovi commented on code in PR #42618:
URL: https://github.com/apache/spark/pull/42618#discussion_r1302362345


##
connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala:
##
@@ -142,18 +143,30 @@ object SchemaConverters {
 if (avroSchema.getTypes.asScala.exists(_.getType == NULL)) {
   // In case of a union with null, eliminate it and make a recursive 
call
   val remainingUnionTypes = AvroUtils.nonNullUnionBranches(avroSchema)
-  if (remainingUnionTypes.size == 1) {
-toSqlTypeHelper(remainingUnionTypes.head, existingRecordNames, 
avroOptions)
-  .copy(nullable = true)
-  } else {
-toSqlTypeHelper(
-  Schema.createUnion(remainingUnionTypes.asJava),
-  existingRecordNames,
-  avroOptions).copy(nullable = true)
-  }
+  toSqlTypeHelper(
+Schema.createUnion(remainingUnionTypes.asJava),
+existingRecordNames,
+avroOptions).copy(nullable = true)
 } else avroSchema.getTypes.asScala.map(_.getType).toSeq match {
   case Seq(t1) =>
-toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, 
avroOptions)
+// If spark.sql.avro.alwaysConvertUnionToStructType is set to 
false (default),
+// we convert Avro union with a single primitive type into a 
primitive Spark type
+// instead of a StructType.
+if (!SQLConf.get.avroAlwaysConvertUnionToStruct) {
+  toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, 
avroOptions)

Review Comment:
   I was going to suggest doing this for Avro schema converter in the Avro data 
source where this is required because the top schema is always expected to be a 
struct.
   
   For example, we change this code in AvroUtils: 
   ```scala
   SchemaConverters.toSqlType(avroSchema, options).dataType match {
 case t: StructType => Some(t)
 case _ => throw new RuntimeException(
   s"""Avro schema cannot be converted to a Spark SQL StructType:
  |
  |${avroSchema.toString(true)}
  |""".stripMargin)
   }
   ```
   
   to something like this: 
   ```scala
   SchemaConverters.toSqlType(avroSchema, options).dataType match {
 case t: StructType => Some(t)
 case t: AtomicType => StructType(StructField("value", t, nullable = true))
 case _ => throw new RuntimeException(
   s"""Avro schema cannot be converted to a Spark SQL StructType:
  |
  |${avroSchema.toString(true)}
  |""".stripMargin)
   }
   ```
   
   Can we also check that this change will not affect `from_avro` SQL function? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #42618: [SPARK-44919] Avro connector: convert a union of a single primitive type to a StructType

2023-08-22 Thread via GitHub


sadikovi commented on code in PR #42618:
URL: https://github.com/apache/spark/pull/42618#discussion_r1302356178


##
connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala:
##
@@ -142,18 +143,30 @@ object SchemaConverters {
 if (avroSchema.getTypes.asScala.exists(_.getType == NULL)) {
   // In case of a union with null, eliminate it and make a recursive 
call
   val remainingUnionTypes = AvroUtils.nonNullUnionBranches(avroSchema)
-  if (remainingUnionTypes.size == 1) {
-toSqlTypeHelper(remainingUnionTypes.head, existingRecordNames, 
avroOptions)
-  .copy(nullable = true)
-  } else {
-toSqlTypeHelper(
-  Schema.createUnion(remainingUnionTypes.asJava),
-  existingRecordNames,
-  avroOptions).copy(nullable = true)
-  }
+  toSqlTypeHelper(
+Schema.createUnion(remainingUnionTypes.asJava),
+existingRecordNames,
+avroOptions).copy(nullable = true)
 } else avroSchema.getTypes.asScala.map(_.getType).toSeq match {
   case Seq(t1) =>
-toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, 
avroOptions)
+// If spark.sql.avro.alwaysConvertUnionToStructType is set to 
false (default),
+// we convert Avro union with a single primitive type into a 
primitive Spark type
+// instead of a StructType.
+if (!SQLConf.get.avroAlwaysConvertUnionToStruct) {
+  toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, 
avroOptions)
+} else {
+  val singleton = avroSchema.getTypes.get(0)
+  val schemaType = toSqlTypeHelper(singleton, existingRecordNames, 
avroOptions)
+  val fieldName = if (avroOptions.useStableIdForUnionType) {
+s"member_${singleton.getName.toLowerCase(Locale.ROOT)}"
+  } else {
+s"member0"

Review Comment:
   Could you explore if there is a way to not duplicate the logic of stable 
identifiers? 



##
connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala:
##
@@ -142,18 +143,30 @@ object SchemaConverters {
 if (avroSchema.getTypes.asScala.exists(_.getType == NULL)) {
   // In case of a union with null, eliminate it and make a recursive 
call
   val remainingUnionTypes = AvroUtils.nonNullUnionBranches(avroSchema)
-  if (remainingUnionTypes.size == 1) {
-toSqlTypeHelper(remainingUnionTypes.head, existingRecordNames, 
avroOptions)
-  .copy(nullable = true)
-  } else {
-toSqlTypeHelper(
-  Schema.createUnion(remainingUnionTypes.asJava),
-  existingRecordNames,
-  avroOptions).copy(nullable = true)
-  }
+  toSqlTypeHelper(
+Schema.createUnion(remainingUnionTypes.asJava),
+existingRecordNames,
+avroOptions).copy(nullable = true)
 } else avroSchema.getTypes.asScala.map(_.getType).toSeq match {
   case Seq(t1) =>
-toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, 
avroOptions)
+// If spark.sql.avro.alwaysConvertUnionToStructType is set to 
false (default),
+// we convert Avro union with a single primitive type into a 
primitive Spark type
+// instead of a StructType.
+if (!SQLConf.get.avroAlwaysConvertUnionToStruct) {
+  toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, 
avroOptions)

Review Comment:
   I was going to suggest doing this for Avro schema converter in the Avro data 
source where this is required because the top schema is always expected to be a 
struct.
   
   For example, we change this code in AvroUtils: 
   ```scala
   // Converts Avro schema to sql type and ensures that the top level data type 
is either an Avro
   // record or a complex union that both result in a conversion to StructType
   def convertAvroToSqlSchema(avroSchema: Schema, avroOptions: AvroOptions): 
StructType = {
 SchemaConverters.toSqlTypeHelper(avroSchema, Set.empty, 
avroOptions).dataType match {
   case t: StructType => t
   case _ => throw new RuntimeException(
 s"""Avro schema cannot be converted to a Spark SQL StructType:
|
|${avroSchema.toString(true)}
|""".stripMargin)
 }
   }
   ```
   
   to something like this: 
   ```scala
   // Converts Avro schema to sql type and ensures that the top level data type 
is either an Avro
   // record or a complex union that both result in a conversion to StructType
   def convertAvroToSqlSchema(avroSchema: Schema, avroOptions: AvroOptions): 
StructType = {
 SchemaConverters.toSqlTypeHelper(avroSchema, Set.empty, 
avroOptions).dataType match {
   case t: StructType => t
   case t: AtomicType => 

[GitHub] [spark] Hisoka-X commented on a diff in pull request #42194: [SPARK-41471][SQL] Reduce Spark shuffle when only one side of a join is KeyGroupedPartitioning

2023-08-22 Thread via GitHub


Hisoka-X commented on code in PR #42194:
URL: https://github.com/apache/spark/pull/42194#discussion_r1302357936


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala:
##
@@ -182,7 +182,16 @@ case class BatchScanExec(
 
   // Now fill missing partition keys with empty partitions
   val partitionMapping = nestGroupedPartitions.toMap
-  finalPartitions = spjParams.commonPartitionValues.get.flatMap {
+
+  // SPARK-41471: We keep to order of partition keys in 
`commonPartitionValues` to

Review Comment:
   Thanks for advise! Adressed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #42599: [DO-NOT-MERGE] Remove Guava from shared classes from IsolatedClientLoader

2023-08-22 Thread via GitHub


pan3793 commented on code in PR #42599:
URL: https://github.com/apache/spark/pull/42599#discussion_r1302355574


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala:
##
@@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging {
 }
 val hiveArtifacts = version.extraDeps ++
   Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde")
-.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++
-  Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames
+.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ 
hadoopJarNames

Review Comment:
   > if we remove Guava from shared classes but also changed the now-non-shared 
metastore Guava version then it's the same net effect and breaks older Hive 
versions.
   
   Hmm, I don't get the idea. After we remove Guava from the shared class, we 
don't need to keep Hive and Spark using the same Guava. And we did not add 
Guava to `HiveVersion#exclusions`, each Hive version should automatically pull 
all transitive dependencies including Guava if necessary.
   
   Anyway, let's simply try (a) first.



##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala:
##
@@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging {
 }
 val hiveArtifacts = version.extraDeps ++
   Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde")
-.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++
-  Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames
+.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ 
hadoopJarNames

Review Comment:
   > if we remove Guava from shared classes but also changed the now-non-shared 
metastore Guava version then it's the same net effect and breaks older Hive 
versions.
   
   Hmm, I don't get the idea. After removing Guava from the shared class, we 
don't need to keep Hive and Spark using the same Guava. And we did not add 
Guava to `HiveVersion#exclusions`, each Hive version should automatically pull 
all transitive dependencies including Guava if necessary.
   
   Anyway, let's simply try (a) first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #42599: [DO-NOT-MERGE] Remove Guava from shared classes from IsolatedClientLoader

2023-08-22 Thread via GitHub


pan3793 commented on code in PR #42599:
URL: https://github.com/apache/spark/pull/42599#discussion_r1302355574


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala:
##
@@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging {
 }
 val hiveArtifacts = version.extraDeps ++
   Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde")
-.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++
-  Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames
+.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ 
hadoopJarNames

Review Comment:
   > if we remove Guava from shared classes but also changed the now-non-shared 
metastore Guava version then it's the same net effect and breaks older Hive 
versions.
   
   Hmm, I don't get the idea. After we remove Guava from the shared class, then 
we don't need to keep Hive and Spark using the same Guava. And we did not add 
Guava to `HiveVersion#exclusions`, each Hive version should automatically pull 
all transitive dependencies including Guava if necessary.
   
   Anyway, let's simply try (a) first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled

2023-08-22 Thread via GitHub


panbingkun commented on PR #42513:
URL: https://github.com/apache/spark/pull/42513#issuecomment-1689111917

   All done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #42620: [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase

2023-08-22 Thread via GitHub


amaliujia commented on PR #42620:
URL: https://github.com/apache/spark/pull/42620#issuecomment-1689108560

   @cloud-fan @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #42620: [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase

2023-08-22 Thread via GitHub


amaliujia opened a new pull request, #42620:
URL: https://github.com/apache/spark/pull/42620

   
   
   ### What changes were proposed in this pull request?
   
   
https://github.com/apache/spark/commit/8ff6b7a04cbaef9c552789ad5550ceab760cb078#diff-f4df4ce19570230091c3b2432e3c84cd2db7059c7b2a03213d272094bd940454
 refactors antlr4 files to `sql/api` but checked in `SqlBaseLexer.tokens`. 
   
   This file is generated so we do not need to check it in.
   
   ### Why are the changes needed?
   
   Remove file that do not need to be checked in.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Existing Test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JoshRosen commented on a diff in pull request #41785: [SPARK-44241][Core] Mistakenly set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause incessant execut

2023-08-22 Thread via GitHub


JoshRosen commented on code in PR #41785:
URL: https://github.com/apache/spark/pull/41785#discussion_r1302331832


##
common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java:
##
@@ -276,10 +277,19 @@ public void initChannel(SocketChannel ch) {
 // Connect to the remote server
 long preConnect = System.nanoTime();
 ChannelFuture cf = bootstrap.connect(address);
-if (!cf.await(conf.connectionCreationTimeoutMs())) {
+
+if (connCreateTimeout <= 0) {
+  cf.awaitUninterruptibly();

Review Comment:
   Filed https://issues.apache.org/jira/browse/SPARK-44920 and opened 
https://github.com/apache/spark/pull/42619 to fix this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JoshRosen opened a new pull request, #42619: [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()

2023-08-22 Thread via GitHub


JoshRosen opened a new pull request, #42619:
URL: https://github.com/apache/spark/pull/42619

   ### What changes were proposed in this pull request?
   
   https://github.com/apache/spark/pull/41785 / SPARK-44241 introduced a new 
`awaitUninterruptibly()` call in one branch of 
`TrasportClientFactory.createClient()` (executed when the connection create 
timeout is non-positive). This PR replaces that call with an interruptible 
`await()` call.
   
   Note that the other pre-existing branches in this method were already using 
`await()`.
   
   ### Why are the changes needed?
   
   Uninterruptible waiting can cause problems when cancelling tasks. For 
details, see https://github.com/apache/spark/pull/16866 / SPARK-19529, an older 
PR fixing a similar issue in this same `TransportClientFactory.createClient()` 
method.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #42578: [SPARK-44841][FOLLOWUP] Add migration guide for the behavior change

2023-08-22 Thread via GitHub


itholic commented on PR #42578:
URL: https://github.com/apache/spark/pull/42578#issuecomment-1689101755

   Yeah, at least we have to fix the typo "propotion" to "proportion". 
Otherwise I don't think it's absolutely necessary, but please feel free to made 
a change. I have no strong opinion unless it doesn't include the wrong 
information.
   
   Thanks for catching the typo, @bjornjorgensen !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on a diff in pull request #42617: [SPARK-44918][SQL][PYTHON] Support named arguments in scalar Python/Pandas UDFs

2023-08-22 Thread via GitHub


dtenedor commented on code in PR #42617:
URL: https://github.com/apache/spark/pull/42617#discussion_r1302322467


##
python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py:
##
@@ -1467,6 +1467,96 @@ def udf(x):
 finally:
 shutil.rmtree(path)
 
+def test_named_arguments(self):
+@pandas_udf("int")
+def test_udf(a, b):
+return a + 10 * b
+
+self.spark.udf.register("test_udf", test_udf)
+
+for i, df in enumerate(
+[
+self.spark.range(2).select(test_udf(a=col("id"), b=col("id") * 
10)),
+self.spark.range(2).select(test_udf(b=col("id") * 10, 
a=col("id"))),
+self.spark.sql("SELECT test_udf(a => id, b => id * 10) FROM 
range(2)"),

Review Comment:
   can we also have a test case with a positional argument first and then a 
named argument after? We can have a positive version of this test case where 
the positional argument maps to the first argument of the UDF and the named 
argument refers to the second, and another negative version where the named 
argument uses the same name as the first (positional) argument? Same for 
`test_udf.py`?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala:
##
@@ -50,6 +50,19 @@ case class UserDefinedPythonFunction(
 udfDeterministic: Boolean) {
 
   def builder(e: Seq[Expression]): Expression = {
+if (pythonEvalType == PythonEvalType.SQL_BATCHED_UDF
+|| pythonEvalType ==PythonEvalType.SQL_ARROW_BATCHED_UDF
+|| pythonEvalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF) {
+  /*
+   * Check if the named arguments:
+   * - don't have duplicated names
+   * - don't contain positional arguments

Review Comment:
   ```suggestion
  * - don't contain positional arguments after named arguments
  * - all map to valid argument names from the function declaration
   ```



##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala:
##
@@ -146,4 +157,30 @@ object PythonUDFRunner {
   }
 }
   }
+
+  def writeUDFs(
+dataOut: DataOutputStream,

Review Comment:
   please indent each of these function arguments by 2 more spaces (for a total 
of +4 spaces each) per the style guide?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled

2023-08-22 Thread via GitHub


HyukjinKwon commented on code in PR #42513:
URL: https://github.com/apache/spark/pull/42513#discussion_r1302314334


##
python/pyspark/testing/utils.py:
##
@@ -539,6 +540,9 @@ def compare_vals(val1, val2):
 elif isinstance(val1, float) and isinstance(val2, float):
 if abs(val1 - val2) > (atol + rtol * abs(val2)):
 return False
+elif isinstance(val1, Decimal) and isinstance(val2, Decimal):

Review Comment:
   @asl3 FYI



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled

2023-08-22 Thread via GitHub


HyukjinKwon commented on code in PR #42513:
URL: https://github.com/apache/spark/pull/42513#discussion_r1302314216


##
python/pyspark/testing/utils.py:
##
@@ -411,8 +412,8 @@ def assertDataFrameEqual(
 
 Note that schema equality is checked only when `expected` is a DataFrame 
(not a list of Rows).
 
-For DataFrames with float values, assertDataFrame asserts approximate 
equality.
-Two float values a and b are approximately equal if the following equation 
is True:
+For DataFrames with float/Decimal values, assertDataFrame asserts 
approximate equality.

Review Comment:
   ```suggestion
   For DataFrames with float/decimal values, assertDataFrame asserts 
approximate equality.
   ```



##
python/pyspark/testing/utils.py:
##
@@ -411,8 +412,8 @@ def assertDataFrameEqual(
 
 Note that schema equality is checked only when `expected` is a DataFrame 
(not a list of Rows).
 
-For DataFrames with float values, assertDataFrame asserts approximate 
equality.
-Two float values a and b are approximately equal if the following equation 
is True:
+For DataFrames with float/Decimal values, assertDataFrame asserts 
approximate equality.
+Two float/Decimal values a and b are approximately equal if the following 
equation is True:

Review Comment:
   ```suggestion
   Two float/decimal values a and b are approximately equal if the 
following equation is True:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled

2023-08-22 Thread via GitHub


HyukjinKwon commented on code in PR #42513:
URL: https://github.com/apache/spark/pull/42513#discussion_r1302314087


##
python/pyspark/sql/functions.py:
##
@@ -7802,8 +7802,20 @@ def to_unix_timestamp(
 
 >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
 >>> df = spark.createDataFrame([("2016-04-08",)], ["e"])
->>> df.select(to_unix_timestamp(df.e).alias('r')).collect()

Review Comment:
   I think we can remove this whole example for now. This is a user-facing 
documentation that shows an example so we won't necessarily show the failure 
example



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled

2023-08-22 Thread via GitHub


HyukjinKwon commented on code in PR #42513:
URL: https://github.com/apache/spark/pull/42513#discussion_r1302313324


##
python/pyspark/sql/dataframe.py:
##
@@ -3912,16 +3912,29 @@ def union(self, other: "DataFrame") -> "DataFrame":
 >>> df2 = spark.createDataFrame([(3, "Charlie"), (4, "Dave")], ["id", 
"name"])
 >>> df1 = df1.withColumn("age", lit(30))
 >>> df2 = df2.withColumn("age", lit(40))
->>> df3 = df1.union(df2)
->>> df3.show()
-+-+---+---+
-| name| id|age|
-+-+---+---+
-|Alice|  1| 30|
-|  Bob|  2| 30|
-|3|Charlie| 40|
-|4|   Dave| 40|
-+-+---+---+
+>>> if spark.conf.get("spark.sql.ansi.enabled") == "false":

Review Comment:
   Can we fix the input types, for example, floats and ints so it can pass 
whether `spark.sql.ansi.enabled` is on or off?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #42614: [MINOR][INFRA] Disable o.a.p.h.InternalParquetRecordWriter logs for tests

2023-08-22 Thread via GitHub


HyukjinKwon commented on PR #42614:
URL: https://github.com/apache/spark/pull/42614#issuecomment-1689075541

   yeah, let's file a JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] tianhanhu opened a new pull request, #42618: [SPARK-44919] Avro connector: convert a union of a single primitive type to a StructType

2023-08-22 Thread via GitHub


tianhanhu opened a new pull request, #42618:
URL: https://github.com/apache/spark/pull/42618

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR adds a new behavior guarded by a new config that changes how a Avro 
union of a single primitive type is converted to Spark types..
   
   
   ### Why are the changes needed?
   
   Spark Avro data source schema converter currently converts union with a 
single primitive type to a Spark primitive type instead of a StructType.
   
   While for more complex union types that consists of multiple primitive 
types, the schema converter translate them into StructTypes.
   
   For example, 
   ```
   import scala.collection.JavaConverters._
   import org.apache.avro._
   import org.apache.spark.sql.avro._
   
   // ["string", "null"]
   SchemaConverters.toSqlType(
   Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), 
Schema.create(Schema.Type.NULL)).asJava)
   ).dataType
   
   // ["string", "int", "null"]
   SchemaConverters.toSqlType(
   Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), 
Schema.create(Schema.Type.INT), Schema.create(Schema.Type.NULL)).asJava)
   ).dataType
   ```
   The first one would return StringType, the second would return 
StructType(StringType, IntegerType).

   It is undesirable in some cases as we may want to consistently get 
StructType out of Avro union. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, a new config `spark.sql.avro.alwaysConvertUnionToStructType` that can 
but used to define how an Avro union of a single type is converted into Spark 
types.
   The default behavior stays the same.
   
   ### How was this patch tested?
   
   New unit tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #42603: [SPARK-44907][PYTHON][CONNECT] `DataFrame.join` should throw IllegalArgumentException for invalid join types

2023-08-22 Thread via GitHub


zhengruifeng commented on PR #42603:
URL: https://github.com/apache/spark/pull/42603#issuecomment-1689073965

   thanks! merged to master and branch-3.5
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #42603: [SPARK-44907][PYTHON][CONNECT] `DataFrame.join` should throw IllegalArgumentException for invalid join types

2023-08-22 Thread via GitHub


zhengruifeng closed pull request #42603: [SPARK-44907][PYTHON][CONNECT] 
`DataFrame.join` should throw IllegalArgumentException for invalid join types
URL: https://github.com/apache/spark/pull/42603


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #42578: [SPARK-44841][FOLLOWUP] Add migration guide for the behavior change

2023-08-22 Thread via GitHub


HyukjinKwon commented on PR #42578:
URL: https://github.com/apache/spark/pull/42578#issuecomment-1689072116

   WDYT @itholic ^


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions

2023-08-22 Thread via GitHub


allisonwang-db commented on code in PR #42272:
URL: https://github.com/apache/spark/pull/42272#discussion_r1302303933


##
python/docs/source/user_guide/sql/python_udtf.rst:
##
@@ -0,0 +1,222 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+===
+Python User-defined Table Functions (UDTFs)
+===
+
+Spark 3.5 introduces Python user-defined table functions (UDTFs), a new type 
of user-defined function. 
+Unlike scalar functions that return a single result value, a UDTF is invoked 
in the FROM clause and returns 
+an entire relation as output. Each UDTF call can accept zero or more 
arguments. 
+These arguments can be scalar constant expressions or separate input relations.
+
+Implementing a Python UDTF
+--
+
+.. currentmodule:: pyspark.sql.functions
+
+To implement a Python UDTF, you can define a class implementing the methods:
+
+.. code-block:: python
+
+class PythonUDTF:
+
+def __init__(self) -> None:
+"""
+Initialize the user-defined table function (UDTF).
+
+This method serves as the default constructor and is called once 
when the
+UDTF is instantiated on the executor side.
+
+Any class fields assigned in this method will be available for 
subsequent
+calls to the `eval` and `terminate` methods.
+
+Notes
+-
+- This method does not accept any extra arguments.
+- You cannot create or reference the Spark session within the 
UDTF. Any
+  attempt to do so will result in a serialization error.
+"""
+...
+
+def eval(self, *args: Any) -> Iterator[Any]:
+"""
+Evaluate the function using the given input arguments.
+
+This method is required and must be implemented.
+
+The arguments provided to the UDTF call are mapped to the values 
in the
+`*args` list sequentially. Each provided scalar expression maps to 
exactly
+one value in this `*args` list. Each provided TABLE argument of N 
columns

Review Comment:
   Updated!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] allisonwang-db commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions

2023-08-22 Thread via GitHub


allisonwang-db commented on code in PR #42272:
URL: https://github.com/apache/spark/pull/42272#discussion_r1302303763


##
examples/src/main/python/sql/udtf.py:
##
@@ -0,0 +1,230 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+A simple example demonstrating Python UDTFs in Spark
+Run with:
+  ./bin/spark-submit examples/src/main/python/sql/udtf.py
+"""
+
+# NOTE that this file is imported in the User Guides in PySpark documentation.
+# The codes are referred via line numbers. See also `literalinclude` directive 
in Sphinx.
+from pyspark.sql import SparkSession
+from pyspark.sql.pandas.utils import require_minimum_pandas_version, 
require_minimum_pyarrow_version
+
+# Python UDTFs use Arrow by default.
+require_minimum_pandas_version()
+require_minimum_pyarrow_version()
+
+
+def python_udtf_simple_example(spark: SparkSession) -> None:
+
+# Define the UDTF class and implement the required `eval` method.
+class SquareNumbers:
+def eval(self, start: int, end: int):  # type: ignore[no-untyped-def]
+for num in range(start, end + 1):
+yield (num, num * num)
+
+from pyspark.sql.functions import lit, udtf
+
+# Create a UDTF using the class definition and the `udtf` function.
+square_num = udtf(SquareNumbers, returnType="num: int, squared: int")
+
+# Invoke the UDTF in PySpark.
+square_num(lit(1), lit(3)).show()  # type: ignore
+# +---+--+
+# |num|squred|

Review Comment:
   Good catch! Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JoshRosen commented on a diff in pull request #42599: [DO-NOT-MERGE] Remove Guava from shared classes from IsolatedClientLoader

2023-08-22 Thread via GitHub


JoshRosen commented on code in PR #42599:
URL: https://github.com/apache/spark/pull/42599#discussion_r1302282764


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala:
##
@@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging {
 }
 val hiveArtifacts = version.extraDeps ++
   Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde")
-.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++
-  Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames
+.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ 
hadoopJarNames

Review Comment:
   Here, I think the Guava version needs to be a function of `version`, rather 
than matching Spark's Guava version or omitting Guava: if we remove Guava from 
shared classes but also changed the now-non-shared metastore Guava version then 
it's the same net effect and breaks older Hive versions.
   
   I believe that newer versions of Hive shade Guava (in which case they're 
insensitive to whatever value we set here). I think we could either (a) 
continue to unconditionally use Guava 14.0.1 here, or (b) conditionally use it 
only for older Hive versions that predated the Guava shading.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JoshRosen commented on a diff in pull request #42599: [DO-NOT-MERGE] Remove Guava from shared classes from IsolatedClientLoader

2023-08-22 Thread via GitHub


JoshRosen commented on code in PR #42599:
URL: https://github.com/apache/spark/pull/42599#discussion_r1302282764


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala:
##
@@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging {
 }
 val hiveArtifacts = version.extraDeps ++
   Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde")
-.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++
-  Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames
+.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ 
hadoopJarNames

Review Comment:
   Here, I think the Guava version needs to be a function of `version`, rather 
than matching Spark's Guava version: if we remove Guava from shared classes but 
also changed the now-non-shared metastore Guava version then it's the same net 
effect and breaks older Hive versions.
   
   I believe that newer versions of Hive shade Guava (in which case they're 
insensitive to whatever value we set here). I think we could either (a) 
continue to unconditionally use Guava 14.0.1 here, or (b) conditionally use it 
only for older Hive versions that predated the Guava shading.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a diff in pull request #42194: [SPARK-41471][SQL] Reduce Spark shuffle when only one side of a join is KeyGroupedPartitioning

2023-08-22 Thread via GitHub


sunchao commented on code in PR #42194:
URL: https://github.com/apache/spark/pull/42194#discussion_r1302282211


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala:
##
@@ -182,7 +182,16 @@ case class BatchScanExec(
 
   // Now fill missing partition keys with empty partitions
   val partitionMapping = nestGroupedPartitions.toMap
-  finalPartitions = spjParams.commonPartitionValues.get.flatMap {
+
+  // SPARK-41471: We keep to order of partition keys in 
`commonPartitionValues` to

Review Comment:
   I think partition values should be sorted by themselves but just 
`mergePartitions` in `EnsureRequirements` doesn't maintain the ordering. 
   
   Instead of doing it here, can we update 
`InternalRowComparableWrapper.mergePartitions` to make sure the result is 
ordered?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] heyihong commented on a diff in pull request #42377: [SPARK-44622][SQL][CONNECT] Implement error enrichment and setting server-side stacktrace

2023-08-22 Thread via GitHub


heyihong commented on code in PR #42377:
URL: https://github.com/apache/spark/pull/42377#discussion_r1301851158


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2843,6 +2843,21 @@ object SQLConf {
   // show full stacktrace in tests but hide in production by default.
   .createWithDefault(Utils.isTesting)
 
+  val SPARK_ENRICH_ERROR_ENABLED =
+buildConf("spark.sql.spark.enrichError.enabled")
+  .doc("When true, it enriches errors with full exception messages on the 
client side.")
+  .version("3.5.0")

Review Comment:
   `org.apache.spark.sql.connect.config.Connect` cannot be configured by 
`spark.conf.set(...)` if I understand correctly?
   
   Maybe we can name this to `spark.sql.connect.enrichError.enabled`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] heyihong commented on a diff in pull request #42377: [SPARK-44622][SQL][CONNECT] Implement error enrichment and setting server-side stacktrace

2023-08-22 Thread via GitHub


heyihong commented on code in PR #42377:
URL: https://github.com/apache/spark/pull/42377#discussion_r1301851158


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2843,6 +2843,21 @@ object SQLConf {
   // show full stacktrace in tests but hide in production by default.
   .createWithDefault(Utils.isTesting)
 
+  val SPARK_ENRICH_ERROR_ENABLED =
+buildConf("spark.sql.spark.enrichError.enabled")
+  .doc("When true, it enriches errors with full exception messages on the 
client side.")
+  .version("3.5.0")

Review Comment:
   `org.apache.spark.sql.connect.config.Connect` cannot be configured by 
`spark.conf.set(...)` if I understand correctly?
   
   Maybe we can name this to `spark.sql.connect.jvmStacktrace.enabled`



##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -2843,6 +2843,21 @@ object SQLConf {
   // show full stacktrace in tests but hide in production by default.
   .createWithDefault(Utils.isTesting)
 
+  val SPARK_ENRICH_ERROR_ENABLED =
+buildConf("spark.sql.spark.enrichError.enabled")
+  .doc("When true, it enriches errors with full exception messages on the 
client side.")
+  .version("3.5.0")

Review Comment:
   `org.apache.spark.sql.connect.config.Connect` cannot be configured by 
`spark.conf.set(...)` if I understand correctly?
   
   Maybe we can name this to `spark.sql.spark.enrichError.enabled`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on pull request #42617: [SPARK-44918][SQL][PYTHON] Support named arguments in scalar Python/Pandas UDFs

2023-08-22 Thread via GitHub


ueshin commented on PR #42617:
URL: https://github.com/apache/spark/pull/42617#issuecomment-1689030291

   cc @dtenedor @HyukjinKwon @xinrong-meng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dtenedor commented on pull request #42595: [SPARK-44901][SQL] Add API in Python UDTF 'analyze' method to return partitioning/ordering expressions

2023-08-22 Thread via GitHub


dtenedor commented on PR #42595:
URL: https://github.com/apache/spark/pull/42595#issuecomment-1689028968

   cc @allisonwang-db @ueshin @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zeruibao commented on a diff in pull request #42503: [SPARK-43380][SQL] Fix Avro data type conversion issues without causing performance regression

2023-08-22 Thread via GitHub


zeruibao commented on code in PR #42503:
URL: https://github.com/apache/spark/pull/42503#discussion_r1302261323


##
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala:
##
@@ -128,6 +133,36 @@ private[sql] class AvroDeserializer(
   case (INT, IntegerType) => (updater, ordinal, value) =>
 updater.setInt(ordinal, value.asInstanceOf[Int])
 
+  case (LONG, dt: TimestampType)

Review Comment:
   oh, that's a good idea!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on a diff in pull request #42503: [SPARK-43380][SQL] Fix Avro data type conversion issues without causing performance regression

2023-08-22 Thread via GitHub


gengliangwang commented on code in PR #42503:
URL: https://github.com/apache/spark/pull/42503#discussion_r1302257921


##
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala:
##
@@ -128,6 +133,36 @@ private[sql] class AvroDeserializer(
   case (INT, IntegerType) => (updater, ordinal, value) =>
 updater.setInt(ordinal, value.asInstanceOf[Int])
 
+  case (LONG, dt: TimestampType)

Review Comment:
   TimestampType/TimestampNTZType/DateType are all of DatetimeType. We can 
reduce duplicated code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zeruibao commented on a diff in pull request #42503: [SPARK-43380][SQL] Fix Avro data type conversion issues without causing performance regression

2023-08-22 Thread via GitHub


zeruibao commented on code in PR #42503:
URL: https://github.com/apache/spark/pull/42503#discussion_r1302256752


##
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala:
##
@@ -128,6 +133,36 @@ private[sql] class AvroDeserializer(
   case (INT, IntegerType) => (updater, ordinal, value) =>
 updater.setInt(ordinal, value.asInstanceOf[Int])
 
+  case (LONG, dt: TimestampType)

Review Comment:
   Is `DatetimeType` the same as `TimestampType`? Why using `DatetimeType` is 
better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin opened a new pull request, #42617: [SPARK-44918][SQL][PYTHON] Support named arguments in scalar Python/Pandas UDFs

2023-08-22 Thread via GitHub


ueshin opened a new pull request, #42617:
URL: https://github.com/apache/spark/pull/42617

   ### What changes were proposed in this pull request?
   
   Supports named arguments in scalar Python/Pandas UDF.
   
   For example:
   
   ```py
   >>> @udf("int")
   ... def test_udf(a, b):
   ... return a + 10 * b
   ...
   >>> spark.udf.register("test_udf", test_udf)
   
   >>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show()
   +-+
   |test_udf(b => (id * 10), a => id)|
   +-+
   |0|
   |  101|
   +-+
   
   >>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show()
   +-+
   |test_udf(b => (id * 10), a => id)|
   +-+
   |0|
   |  101|
   +-+
   ```
   
   or:
   
   ```py
   >>> @pandas_udf("int")
   ... def test_udf(a, b):
   ... return a + 10 * b
   ...
   >>> spark.udf.register("test_udf", test_udf)
   
   >>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show()
   +-+
   |test_udf(b => (id * 10), a => id)|
   +-+
   |0|
   |  101|
   +-+
   
   >>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show()
   +-+
   |test_udf(b => (id * 10), a => id)|
   +-+
   |0|
   |  101|
   +-+
   ```
   
   ### Why are the changes needed?
   
   Now that named arguments support was added 
(https://github.com/apache/spark/pull/41796, 
https://github.com/apache/spark/pull/42020).
   
   Scalar Python/Pandas UDFs can support it.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, named arguments will be available for scalar Python/Pandas UDFs.
   
   ### How was this patch tested?
   
   Added related tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42615: [SPARK-44916][DOCS][TESTS] Document Spark Driver Live Log UI

2023-08-22 Thread via GitHub


dongjoon-hyun commented on PR #42615:
URL: https://github.com/apache/spark/pull/42615#issuecomment-1688930713

   Merged to master for Apache Spark 4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #42615: [SPARK-44916][DOCS][TESTS] Document Spark Driver Live Log UI

2023-08-22 Thread via GitHub


dongjoon-hyun closed pull request #42615: [SPARK-44916][DOCS][TESTS] Document 
Spark Driver Live Log UI
URL: https://github.com/apache/spark/pull/42615


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42615: [SPARK-44916][DOCS][TESTS] Document Spark Driver Live Log UI

2023-08-22 Thread via GitHub


dongjoon-hyun commented on PR #42615:
URL: https://github.com/apache/spark/pull/42615#issuecomment-1688925615

   Thank you, @viirya !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #42615: [SPARK-44916][DOCS][TESTS] Document Spark Driver Live Log UI

2023-08-22 Thread via GitHub


dongjoon-hyun commented on PR #42615:
URL: https://github.com/apache/spark/pull/42615#issuecomment-1688915353

   Could you review this PR, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default

2023-08-22 Thread via GitHub


dongjoon-hyun commented on PR #40390:
URL: https://github.com/apache/spark/pull/40390#issuecomment-1688904895

   Thank you, @ulysses-you and @cloud-fan .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >