[GitHub] [spark] yaooqinn commented on pull request #42620: [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase
yaooqinn commented on PR #42620: URL: https://github.com/apache/spark/pull/42620#issuecomment-1689327389 thanks, merged to master and 3.5.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn closed pull request #42620: [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase
yaooqinn closed pull request #42620: [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase URL: https://github.com/apache/spark/pull/42620 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #42619: [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()
yaooqinn commented on PR #42619: URL: https://github.com/apache/spark/pull/42619#issuecomment-1689325031 Thanks @JoshRosen @cloud-fan @HyukjinKwon and @dongjoon-hyun Merged to master, '3.5.0', '3.4.2', and '3.3.4' -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn closed pull request #42619: [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()
yaooqinn closed pull request #42619: [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient() URL: https://github.com/apache/spark/pull/42619 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #42608: [SPARK-42017][PYTHON][CONNECT][TESTS] Enable `ColumnParityTests.test_access_column`
zhengruifeng commented on code in PR #42608: URL: https://github.com/apache/spark/pull/42608#discussion_r1302514076 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Colum alias = self._get_alias() if self._plan is None: raise SparkConnectException("Cannot analyze on empty plan.") + +if "." not in item and "*" not in item and "`" not in item and item not in self.columns: Review Comment: good idea! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42624: [SPARK-44925][K8S] K8s default service token file should not be materialized into token
dongjoon-hyun commented on PR #42624: URL: https://github.com/apache/spark/pull/42624#issuecomment-1689321845 Merged to master/3.5/3.4/3.3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42591: [SPARK-44784][CONNECT] Make SBT testing hermetic.
LuciferYang commented on code in PR #42591: URL: https://github.com/apache/spark/pull/42591#discussion_r1302509829 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/test/RemoteSparkSession.scala: ## @@ -0,0 +1,241 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.test + +import java.io.{File, IOException, OutputStream} +import java.lang.ProcessBuilder +import java.lang.ProcessBuilder.Redirect +import java.nio.file.Paths +import java.util.concurrent.TimeUnit + +import scala.concurrent.duration.FiniteDuration + +import org.apache.commons.lang3.{JavaVersion, SystemUtils} +import org.scalactic.source.Position +import org.scalatest.{BeforeAndAfterAll, Tag} + +import org.apache.spark.SparkBuildInfo +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.connect.client.GrpcRetryHandler.RetryPolicy +import org.apache.spark.sql.connect.client.SparkConnectClient +import org.apache.spark.sql.connect.common.config.ConnectCommon +import org.apache.spark.sql.test.IntegrationTestUtils._ + +/** + * An util class to start a local spark connect server in a different process for local E2E tests. + * Pre-running the tests, the spark connect artifact needs to be built using e.g. `build/sbt + * package`. It is designed to start the server once but shared by all tests. It is equivalent to + * use the following command to start the connect server via command line: + * + * {{{ + * bin/spark-shell \ + * --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar | paste -sd ',' -` \ + * --conf spark.plugins=org.apache.spark.sql.connect.SparkConnectPlugin + * }}} + * + * Set system property `spark.test.home` or env variable `SPARK_HOME` if the test is not executed + * from the Spark project top folder. Set system property `spark.debug.sc.jvm.client=true` or + * environment variable `SPARK_DEBUG_SC_JVM_CLIENT=true` to print the server process output in the + * console to debug server start stop problems. + */ +object SparkConnectServerUtils { + + // Server port + val port: Int = +ConnectCommon.CONNECT_GRPC_BINDING_PORT + util.Random.nextInt(1000) + + @volatile private var stopped = false + + private var consoleOut: OutputStream = _ + private val serverStopCommand = "q" + + private lazy val sparkConnect: java.lang.Process = { +debug("Starting the Spark Connect Server...") +val connectJar = findJar( + "connector/connect/server", + "spark-connect-assembly", + "spark-connect").getCanonicalPath + +val command = Seq.newBuilder[String] +command += "bin/spark-submit" +command += "--driver-class-path" += connectJar +command += "--class" += "org.apache.spark.sql.connect.SimpleSparkConnectService" +command += "--conf" += s"spark.connect.grpc.binding.port=$port" +command ++= testConfigs +command ++= debugConfigs +command += connectJar +val builder = new ProcessBuilder(command.result(): _*) +builder.directory(new File(sparkHome)) +val environment = builder.environment() +environment.remove("SPARK_DIST_CLASSPATH") +if (isDebug) { + builder.redirectError(Redirect.INHERIT) + builder.redirectOutput(Redirect.INHERIT) +} + +val process = builder.start() +consoleOut = process.getOutputStream + +// Adding JVM shutdown hook +sys.addShutdownHook(stop()) +process + } + + /** + * As one shared spark will be started for all E2E tests, for tests that needs some special + * configs, we add them here + */ + private def testConfigs: Seq[String] = { +// To find InMemoryTableCatalog for V2 writer tests +val catalystTestJar = + tryFindJar("sql/catalyst", "spark-catalyst", "spark-catalyst", test = true) Review Comment: Yes, it will build. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail:
[GitHub] [spark] yaooqinn commented on pull request #42614: [SPARK-44922][TESTS] Disable o.a.p.h.InternalParquetRecordWriter logs for tests
yaooqinn commented on PR #42614: URL: https://github.com/apache/spark/pull/42614#issuecomment-1689317257 Test failures in RocksDBStateStore are frequently occurring and irrelevant. thanks @dongjoon-hyun @HyukjinKwon Merged to master/3.5/3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42624: [SPARK-44925][K8S] K8s default service token file should not be materialized into token
dongjoon-hyun closed pull request #42624: [SPARK-44925][K8S] K8s default service token file should not be materialized into token URL: https://github.com/apache/spark/pull/42624 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zwangsheng commented on a diff in pull request #42600: [SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of feature steps
zwangsheng commented on code in PR #42600: URL: https://github.com/apache/spark/pull/42600#discussion_r1302507782 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala: ## @@ -117,6 +117,7 @@ private[spark] class KubernetesDriverConf( override def annotations: Map[String, String] = { KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, KUBERNETES_DRIVER_ANNOTATION_PREFIX) + .map(p => (p._1, Utils.substituteAppNExecIds(p._2, appId, ""))) Review Comment: Sorry, i misunderstood `the consistent short style`. Does the following code conform ``` .map{ case(k, v) => (k, Utils.substituteAppNExecIds(v, appId, executorId)) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn closed pull request #42614: [SPARK-44922][TESTS] Disable o.a.p.h.InternalParquetRecordWriter logs for tests
yaooqinn closed pull request #42614: [SPARK-44922][TESTS] Disable o.a.p.h.InternalParquetRecordWriter logs for tests URL: https://github.com/apache/spark/pull/42614 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`
LuciferYang commented on code in PR #42596: URL: https://github.com/apache/spark/pull/42596#discussion_r1302455703 ## python/pyspark/sql/functions.py: ## @@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col @try_remote_functions def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column: -"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count -of column `col`. +""" +Applies an aggregate function to return an approximate distinct count of the specified column. -.. versionadded:: 2.1.0 +This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct +elements in a column or a group of columns. -.. versionchanged:: 3.4.0 -Supports Spark Connect. +.. versionadded:: 2.1.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Parameters -- col : :class:`~pyspark.sql.Column` or str +The label of the column to count distinct values in. rsd : float, optional -maximum relative standard deviation allowed (default = 0.05). -For rsd < 0.01, it is more efficient to use :func:`count_distinct` +The maximum allowed relative standard deviation (default = 0.05). +If rsd < 0.01, it would be more efficient to use :func:`count_distinct`. Returns --- :class:`~pyspark.sql.Column` -the column of computed results. +A new Column object representing the approximate unique count. + +See Also +-- +:meth:`pyspark.sql.functions.count_distinct` Examples +Example 1: Counting distinct values in a single column DataFrame representing integers + +>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.createDataFrame([1,2,2,3], "INT") >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show() +---+ |distinct_values| +---+ | 3| +---+ + +Example 2: Counting distinct values in a single column DataFrame representing strings + +>>> from pyspark.sql.functions import approx_count_distinct +>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], "string").toDF("fruit") +>>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show() ++---+ +|distinct_fruits| ++---+ +| 3| ++---+ + +Example 3: Counting distinct values in a DataFrame with multiple columns + +>>> from pyspark.sql.functions import approx_count_distinct, struct +>>> df = spark.createDataFrame([("Alice", 1), +... ("Alice", 2), +... ("Bob", 3), +... ("Bob", 3)], ["name", "value"]) +>>> df = df.withColumn("combined", struct("name", "value")) +>>> df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show() ++--+ +|distinct_pairs| ++--+ +| 3| ++--+ + +Example 4: Counting distinct values with a specified relative standard deviation + +>>> from pyspark.sql.functions import approx_count_distinct +>>> df = spark.range(10) +>>> df.agg(approx_count_distinct("id", 0.1).alias('distinct_values')).show() Review Comment: Change to ``` >>> df.agg(approx_count_distinct("id").alias('with_default_rsd'), ...approx_count_distinct("id", 0.1).alias('with_rsd_0.1')).show() +++ |with_default_rsd|with_rsd_0.1| +++ | 95546| 102065| +++ ``` when I am not sure if this is the `compare ` you mentioned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42548: [WIP][SPARK-44750][PYTHON][CONNECT] Apply configuration to sparksession during creation
HyukjinKwon commented on code in PR #42548: URL: https://github.com/apache/spark/pull/42548#discussion_r1302478884 ## python/pyspark/sql/connect/session.py: ## @@ -176,6 +180,27 @@ def enableHiveSupport(self) -> "SparkSession.Builder": error_class="NOT_IMPLEMENTED", message_parameters={"feature": "enableHiveSupport"} ) +def _apply_options(self, session: "SparkSession") -> None: Review Comment: BTW, we will have to fix `pyspark.sql.session.SparkSession.Builder.getOrCreate` too to pass the configurations properly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42616: [SPARK-44840][SQL][3.5] Make `array_insert()` 1-based for negative indexes
dongjoon-hyun closed pull request #42616: [SPARK-44840][SQL][3.5] Make `array_insert()` 1-based for negative indexes URL: https://github.com/apache/spark/pull/42616 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42616: [SPARK-44840][SQL][3.5] Make `array_insert()` 1-based for negative indexes
dongjoon-hyun commented on PR #42616: URL: https://github.com/apache/spark/pull/42616#issuecomment-1689272709 Merged to branch-3.5. Thank you, @MaxGekk and all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42619: [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()
dongjoon-hyun commented on PR #42619: URL: https://github.com/apache/spark/pull/42619#issuecomment-1689271967 Could you re-trigger the failed test cases although it looks irrelevant, @JoshRosen ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42600: [SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of feature steps
dongjoon-hyun commented on PR #42600: URL: https://github.com/apache/spark/pull/42600#issuecomment-1689271048 Thank you for update, @zwangsheng . It seems that there is a minor miscommunication in the above. I added two comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #42600: [SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of feature steps
dongjoon-hyun commented on code in PR #42600: URL: https://github.com/apache/spark/pull/42600#discussion_r1302473735 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala: ## @@ -117,6 +117,7 @@ private[spark] class KubernetesDriverConf( override def annotations: Map[String, String] = { KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, KUBERNETES_DRIVER_ANNOTATION_PREFIX) + .map(p => (p._1, Utils.substituteAppNExecIds(p._2, appId, ""))) Review Comment: Sorry, but what I suggested (https://github.com/apache/spark/pull/42600#discussion_r1302137555) is the following style instead of `p => ...`. ```scala case (k, v) => ... ``` ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala: ## @@ -188,6 +189,7 @@ private[spark] class KubernetesExecutorConf( override def annotations: Map[String, String] = { KubernetesUtils.parsePrefixedKeyValuePairs(sparkConf, KUBERNETES_EXECUTOR_ANNOTATION_PREFIX) + .map(p => (p._1, Utils.substituteAppNExecIds(p._2, appId, executorId))) Review Comment: ditto. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42531: [SPARK-44846][SQL] Pull out complex grouping expressions after remove redundant aggregates
cloud-fan commented on code in PR #42531: URL: https://github.com/apache/spark/pull/42531#discussion_r1302472237 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregates.scala: ## @@ -42,11 +42,12 @@ object RemoveRedundantAggregates extends Rule[LogicalPlan] with AliasHelper { ) // We might have introduces non-deterministic grouping expression Review Comment: I have a different idea. I think it's risky to put the new grouping expressions back to the `Aggregate`, as the grouping expression contains things in the SELECT list. This is a long-standing issue and I feel it's better to just create a `Project` below the `Aggregate` to calculate grouping expressions, and other optimizer rules can merge/eliminate this extra `Project` if it only contains `Attributes`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn opened a new pull request, #42625: [SPARK-44802][INFRA][FOLLOWUP] Eagerly check if the token is valid to align with the behavior of username/password authn
yaooqinn opened a new pull request, #42625: URL: https://github.com/apache/spark/pull/42625 ### What changes were proposed in this pull request? The SPARK-44802 now allows for token authentication when resolving Jira issues in pull request merging. However, the token auth is kinda lazy during the initial handshake, maintainers might get confused someday. This pull request promptly calls the current_user() function to initiate authentication and provides clear instructions for token expiration. ### Why are the changes needed? make it easy for maintainers to update their expired Jira tokens. ### Does this PR introduce _any_ user-facing change? no, for maintainers ### How was this patch tested? locally verified the code snippet ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #42567: [SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full
HeartSaVioR closed pull request #42567: [SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full URL: https://github.com/apache/spark/pull/42567 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #42567: [SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full
HeartSaVioR commented on PR #42567: URL: https://github.com/apache/spark/pull/42567#issuecomment-1689252968 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42548: [WIP][SPARK-44750][PYTHON][CONNECT] Apply configuration to sparksession during creation
HyukjinKwon commented on code in PR #42548: URL: https://github.com/apache/spark/pull/42548#discussion_r1302457534 ## python/pyspark/sql/connect/session.py: ## @@ -176,6 +180,27 @@ def enableHiveSupport(self) -> "SparkSession.Builder": error_class="NOT_IMPLEMENTED", message_parameters={"feature": "enableHiveSupport"} ) +def _apply_options(self, session: "SparkSession") -> None: +with self._lock: +for k, v in self._options.items(): +try: +session.conf.set(k, v) +except AnalysisException as e: +current = currentframe() Review Comment: What are you trying to do here BTW? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42548: [WIP][SPARK-44750][PYTHON][CONNECT] Apply configuration to sparksession during creation
HyukjinKwon commented on code in PR #42548: URL: https://github.com/apache/spark/pull/42548#discussion_r1302457329 ## python/pyspark/sql/connect/session.py: ## @@ -176,6 +180,27 @@ def enableHiveSupport(self) -> "SparkSession.Builder": error_class="NOT_IMPLEMENTED", message_parameters={"feature": "enableHiveSupport"} ) +def _apply_options(self, session: "SparkSession") -> None: +with self._lock: +for k, v in self._options.items(): +try: +session.conf.set(k, v) +except AnalysisException as e: +current = currentframe() +lineno = getframeinfo(current).lineno + 1 if current is not None else 0 +print( Review Comment: This outer `print` won't be needed. You could directly use `warnings.warn` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`
LuciferYang commented on code in PR #42596: URL: https://github.com/apache/spark/pull/42596#discussion_r1302445423 ## python/pyspark/sql/functions.py: ## @@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col @try_remote_functions def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column: -"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count -of column `col`. +""" +Applies an aggregate function to return an approximate distinct count of the specified column. -.. versionadded:: 2.1.0 +This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct +elements in a column or a group of columns. -.. versionchanged:: 3.4.0 -Supports Spark Connect. +.. versionadded:: 2.1.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Parameters -- col : :class:`~pyspark.sql.Column` or str +The label of the column to count distinct values in. rsd : float, optional -maximum relative standard deviation allowed (default = 0.05). -For rsd < 0.01, it is more efficient to use :func:`count_distinct` +The maximum allowed relative standard deviation (default = 0.05). +If rsd < 0.01, it would be more efficient to use :func:`count_distinct`. Returns --- :class:`~pyspark.sql.Column` -the column of computed results. +A new Column object representing the approximate unique count. + +See Also +-- +:meth:`pyspark.sql.functions.count_distinct` Examples +Example 1: Counting distinct values in a single column DataFrame representing integers + +>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.createDataFrame([1,2,2,3], "INT") >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show() +---+ |distinct_values| +---+ | 3| +---+ + +Example 2: Counting distinct values in a single column DataFrame representing strings + +>>> from pyspark.sql.functions import approx_count_distinct +>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], "string").toDF("fruit") Review Comment: change to `df = spark.createDataFrame([("apple",), ("orange",), ("apple",), ("banana",)], ['fruit'])`, Is that so? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`
LuciferYang commented on code in PR #42596: URL: https://github.com/apache/spark/pull/42596#discussion_r1302445423 ## python/pyspark/sql/functions.py: ## @@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col @try_remote_functions def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column: -"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count -of column `col`. +""" +Applies an aggregate function to return an approximate distinct count of the specified column. -.. versionadded:: 2.1.0 +This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct +elements in a column or a group of columns. -.. versionchanged:: 3.4.0 -Supports Spark Connect. +.. versionadded:: 2.1.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Parameters -- col : :class:`~pyspark.sql.Column` or str +The label of the column to count distinct values in. rsd : float, optional -maximum relative standard deviation allowed (default = 0.05). -For rsd < 0.01, it is more efficient to use :func:`count_distinct` +The maximum allowed relative standard deviation (default = 0.05). +If rsd < 0.01, it would be more efficient to use :func:`count_distinct`. Returns --- :class:`~pyspark.sql.Column` -the column of computed results. +A new Column object representing the approximate unique count. + +See Also +-- +:meth:`pyspark.sql.functions.count_distinct` Examples +Example 1: Counting distinct values in a single column DataFrame representing integers + +>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.createDataFrame([1,2,2,3], "INT") >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show() +---+ |distinct_values| +---+ | 3| +---+ + +Example 2: Counting distinct values in a single column DataFrame representing strings + +>>> from pyspark.sql.functions import approx_count_distinct +>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], "string").toDF("fruit") Review Comment: change to `spark.createDataFrame([("apple"), ("orange"), ("apple"), ("banana")], ['fruit'])`, Is that so? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`
LuciferYang commented on code in PR #42596: URL: https://github.com/apache/spark/pull/42596#discussion_r1302455703 ## python/pyspark/sql/functions.py: ## @@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col @try_remote_functions def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column: -"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count -of column `col`. +""" +Applies an aggregate function to return an approximate distinct count of the specified column. -.. versionadded:: 2.1.0 +This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct +elements in a column or a group of columns. -.. versionchanged:: 3.4.0 -Supports Spark Connect. +.. versionadded:: 2.1.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Parameters -- col : :class:`~pyspark.sql.Column` or str +The label of the column to count distinct values in. rsd : float, optional -maximum relative standard deviation allowed (default = 0.05). -For rsd < 0.01, it is more efficient to use :func:`count_distinct` +The maximum allowed relative standard deviation (default = 0.05). +If rsd < 0.01, it would be more efficient to use :func:`count_distinct`. Returns --- :class:`~pyspark.sql.Column` -the column of computed results. +A new Column object representing the approximate unique count. + +See Also +-- +:meth:`pyspark.sql.functions.count_distinct` Examples +Example 1: Counting distinct values in a single column DataFrame representing integers + +>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.createDataFrame([1,2,2,3], "INT") >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show() +---+ |distinct_values| +---+ | 3| +---+ + +Example 2: Counting distinct values in a single column DataFrame representing strings + +>>> from pyspark.sql.functions import approx_count_distinct +>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], "string").toDF("fruit") +>>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show() ++---+ +|distinct_fruits| ++---+ +| 3| ++---+ + +Example 3: Counting distinct values in a DataFrame with multiple columns + +>>> from pyspark.sql.functions import approx_count_distinct, struct +>>> df = spark.createDataFrame([("Alice", 1), +... ("Alice", 2), +... ("Bob", 3), +... ("Bob", 3)], ["name", "value"]) +>>> df = df.withColumn("combined", struct("name", "value")) +>>> df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show() ++--+ +|distinct_pairs| ++--+ +| 3| ++--+ + +Example 4: Counting distinct values with a specified relative standard deviation + +>>> from pyspark.sql.functions import approx_count_distinct +>>> df = spark.range(10) +>>> df.agg(approx_count_distinct("id", 0.1).alias('distinct_values')).show() Review Comment: Change to xx ``` >>> df.agg(approx_count_distinct("id").alias('with_default_rsd'), ...approx_count_distinct("id", 0.1).alias('with_rsd_0.1')).show() +++ |with_default_rsd|with_rsd_0.1| +++ | 95546| 102065| +++ ``` when I am not sure if this is the `compare ` you mentioned. ## python/pyspark/sql/functions.py: ## @@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col @try_remote_functions def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column: -"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count -of column `col`. +""" +Applies an aggregate function to return an approximate distinct count of the specified column. -.. versionadded:: 2.1.0 +This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct +elements in a column or a group of columns. Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at:
[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method
beliefer commented on code in PR #42612: URL: https://github.com/apache/spark/pull/42612#discussion_r1302453549 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala: ## @@ -279,7 +282,8 @@ case class StaticInvoke( inputTypes: Seq[AbstractDataType] = Nil, propagateNull: Boolean = true, returnNullable: Boolean = true, -isDeterministic: Boolean = true) extends InvokeLike { +isDeterministic: Boolean = true, +scalarFunction: Option[ScalarFunction[_]] = None) extends InvokeLike { Review Comment: Please refer https://github.com/apache/spark/pull/42612#discussion_r1302453275 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method
beliefer commented on code in PR #42612: URL: https://github.com/apache/spark/pull/42612#discussion_r1302453275 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala: ## @@ -159,7 +159,7 @@ object V2ExpressionUtils extends SQLConfHelper with Logging { StaticInvoke(scalarFunc.getClass, scalarFunc.resultType(), MAGIC_METHOD_NAME, arguments, inputTypes = declaredInputTypes, propagateNull = false, returnNullable = scalarFunc.isResultNullable, - isDeterministic = scalarFunc.isDeterministic) + isDeterministic = scalarFunc.isDeterministic, scalarFunction = Some(scalarFunc)) Review Comment: Because spark already construct the StaticInvoke by calling these methods. e.g. `scalarFunc.getClass`, `scalarFunc.resultType()` and so on. Please keep the same way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42608: [SPARK-42017][PYTHON][CONNECT][TESTS] Enable `ColumnParityTests.test_access_column`
HyukjinKwon commented on code in PR #42608: URL: https://github.com/apache/spark/pull/42608#discussion_r1302449910 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Colum alias = self._get_alias() if self._plan is None: raise SparkConnectException("Cannot analyze on empty plan.") + +if "." not in item and "*" not in item and "`" not in item and item not in self.columns: Review Comment: Should we maybe just: ```python col = _to_col_with_plan_id( col=alias if alias is not None else item, plan_id=self._plan._plan_id, ) self.select(col) return col ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ConeyLiu commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method
ConeyLiu commented on code in PR #42612: URL: https://github.com/apache/spark/pull/42612#discussion_r1302448847 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala: ## @@ -279,7 +282,8 @@ case class StaticInvoke( inputTypes: Seq[AbstractDataType] = Nil, propagateNull: Boolean = true, returnNullable: Boolean = true, -isDeterministic: Boolean = true) extends InvokeLike { +isDeterministic: Boolean = true, +scalarFunction: Option[ScalarFunction[_]] = None) extends InvokeLike { Review Comment: We need both `name` and `canonicalName` to build `UserDefinedScalarFunc`. So we need to add two extra parameters. Wouldn't it be more intuitive to use `ScalarFunction`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`
LuciferYang commented on code in PR #42596: URL: https://github.com/apache/spark/pull/42596#discussion_r1302446464 ## python/pyspark/sql/functions.py: ## @@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col @try_remote_functions def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column: -"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count -of column `col`. +""" +Applies an aggregate function to return an approximate distinct count of the specified column. -.. versionadded:: 2.1.0 +This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct +elements in a column or a group of columns. -.. versionchanged:: 3.4.0 -Supports Spark Connect. +.. versionadded:: 2.1.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Parameters -- col : :class:`~pyspark.sql.Column` or str +The label of the column to count distinct values in. rsd : float, optional -maximum relative standard deviation allowed (default = 0.05). -For rsd < 0.01, it is more efficient to use :func:`count_distinct` +The maximum allowed relative standard deviation (default = 0.05). +If rsd < 0.01, it would be more efficient to use :func:`count_distinct`. Returns --- :class:`~pyspark.sql.Column` -the column of computed results. +A new Column object representing the approximate unique count. + +See Also +-- +:meth:`pyspark.sql.functions.count_distinct` Examples +Example 1: Counting distinct values in a single column DataFrame representing integers + +>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.createDataFrame([1,2,2,3], "INT") >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show() +---+ |distinct_values| +---+ | 3| +---+ + +Example 2: Counting distinct values in a single column DataFrame representing strings + +>>> from pyspark.sql.functions import approx_count_distinct +>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], "string").toDF("fruit") +>>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show() ++---+ +|distinct_fruits| ++---+ +| 3| ++---+ + +Example 3: Counting distinct values in a DataFrame with multiple columns + +>>> from pyspark.sql.functions import approx_count_distinct, struct +>>> df = spark.createDataFrame([("Alice", 1), +... ("Alice", 2), +... ("Bob", 3), +... ("Bob", 3)], ["name", "value"]) +>>> df = df.withColumn("combined", struct("name", "value")) +>>> df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show() Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`
LuciferYang commented on code in PR #42596: URL: https://github.com/apache/spark/pull/42596#discussion_r1302445841 ## python/pyspark/sql/functions.py: ## @@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col @try_remote_functions def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column: -"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count -of column `col`. +""" +Applies an aggregate function to return an approximate distinct count of the specified column. -.. versionadded:: 2.1.0 +This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct +elements in a column or a group of columns. -.. versionchanged:: 3.4.0 -Supports Spark Connect. +.. versionadded:: 2.1.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Parameters -- col : :class:`~pyspark.sql.Column` or str +The label of the column to count distinct values in. rsd : float, optional -maximum relative standard deviation allowed (default = 0.05). -For rsd < 0.01, it is more efficient to use :func:`count_distinct` +The maximum allowed relative standard deviation (default = 0.05). +If rsd < 0.01, it would be more efficient to use :func:`count_distinct`. Returns --- :class:`~pyspark.sql.Column` -the column of computed results. +A new Column object representing the approximate unique count. + +See Also +-- +:meth:`pyspark.sql.functions.count_distinct` Examples +Example 1: Counting distinct values in a single column DataFrame representing integers + +>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.createDataFrame([1,2,2,3], "INT") Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`
LuciferYang commented on code in PR #42596: URL: https://github.com/apache/spark/pull/42596#discussion_r1302445423 ## python/pyspark/sql/functions.py: ## @@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Col @try_remote_functions def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> Column: -"""Aggregate function: returns a new :class:`~pyspark.sql.Column` for approximate distinct count -of column `col`. +""" +Applies an aggregate function to return an approximate distinct count of the specified column. -.. versionadded:: 2.1.0 +This function returns a new :class:`~pyspark.sql.Column` that estimates the number of distinct +elements in a column or a group of columns. -.. versionchanged:: 3.4.0 -Supports Spark Connect. +.. versionadded:: 2.1.0 .. versionchanged:: 3.4.0 Supports Spark Connect. Parameters -- col : :class:`~pyspark.sql.Column` or str +The label of the column to count distinct values in. rsd : float, optional -maximum relative standard deviation allowed (default = 0.05). -For rsd < 0.01, it is more efficient to use :func:`count_distinct` +The maximum allowed relative standard deviation (default = 0.05). +If rsd < 0.01, it would be more efficient to use :func:`count_distinct`. Returns --- :class:`~pyspark.sql.Column` -the column of computed results. +A new Column object representing the approximate unique count. + +See Also +-- +:meth:`pyspark.sql.functions.count_distinct` Examples +Example 1: Counting distinct values in a single column DataFrame representing integers + +>>> from pyspark.sql.functions import approx_count_distinct >>> df = spark.createDataFrame([1,2,2,3], "INT") >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show() +---+ |distinct_values| +---+ | 3| +---+ + +Example 2: Counting distinct values in a single column DataFrame representing strings + +>>> from pyspark.sql.functions import approx_count_distinct +>>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], "string").toDF("fruit") Review Comment: like `spark.createDataFrame([("apple"), ("orange"), ("apple"), ("banana")], ['fruit'])`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42624: [SPARK-44925][K8S] K8s default service token file should not be materialized into token
dongjoon-hyun commented on PR #42624: URL: https://github.com/apache/spark/pull/42624#issuecomment-1689229669 Thank you so much, @yaooqinn ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #42549: [SPARK-44860][SQL] Add SESSION_USER function
yaooqinn commented on code in PR #42549: URL: https://github.com/apache/spark/pull/42549#discussion_r1302442801 ## sql/core/src/test/scala/org/apache/spark/sql/MiscFunctionsSuite.scala: ## @@ -48,15 +48,16 @@ class MiscFunctionsSuite extends QueryTest with SharedSparkSession { checkAnswer(df.selectExpr("version()"), df.select(version())) } - test("SPARK-21957: get current_user in normal spark apps") { + test("SPARK-21957, SPARK-44860: get current_user, session_user in normal spark apps") { Review Comment: Can we also update `SPARK-21957: get current_user through thrift server`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #42549: [SPARK-44860][SQL] Add SESSION_USER function
yaooqinn commented on code in PR #42549: URL: https://github.com/apache/spark/pull/42549#discussion_r1302442192 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala: ## @@ -95,12 +95,13 @@ trait ColumnResolutionHelper extends Logging { } } -// support CURRENT_DATE, CURRENT_TIMESTAMP, and grouping__id + // support CURRENT_DATE, CURRENT_TIMESTAMP, CURRENT_USER, SESSION_USER and grouping__id Review Comment: ```suggestion // support CURRENT_DATE, CURRENT_TIMESTAMP, CURRENT_USER, USER, SESSION_USER and grouping__id ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method
beliefer commented on code in PR #42612: URL: https://github.com/apache/spark/pull/42612#discussion_r1302441432 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala: ## @@ -279,7 +282,8 @@ case class StaticInvoke( inputTypes: Seq[AbstractDataType] = Nil, propagateNull: Boolean = true, returnNullable: Boolean = true, -isDeterministic: Boolean = true) extends InvokeLike { +isDeterministic: Boolean = true, +scalarFunction: Option[ScalarFunction[_]] = None) extends InvokeLike { Review Comment: It seems you just try to get the name of scalar function, shall we only add the name parameters ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zwangsheng commented on pull request #42600: [SPARK-44906][K8S] Make `Kubernetes[Driver|Executor]Conf.annotations` substitute annotations instead of feature steps
zwangsheng commented on PR #42600: URL: https://github.com/apache/spark/pull/42600#issuecomment-1689213258 Thanks for your review @dongjoon-hyun! Addressed, let me know if anything missed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #42624: [SPARK-44925][K8S] K8s default service token file should not be materialized into token
dongjoon-hyun opened a new pull request, #42624: URL: https://github.com/apache/spark/pull/42624 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ragnarok56 opened a new pull request, #42623: [SPARK-44924][SS] Add config for FileStreamSource cached files
ragnarok56 opened a new pull request, #42623: URL: https://github.com/apache/spark/pull/42623 ### What changes were proposed in this pull request? This change adds configuration options for the streaming input File Source for `maxCachedFiles` and `discardCachedFilesRatio`. These values were originally introduced with https://github.com/apache/spark/pull/27620 but were hardcoded to 10,000 and 0.2, respectively. ### Why are the changes needed? Under certain workloads with large `maxFilesPerTrigger` settings, the performance gain from caching the input files capped at 10,000 can cause a cluster to be underutilized and jobs to take longer to finish if each batch takes a while to finish. For example, a job with `maxFilesPerTrigger` set to 100,000 would do all 100k in batch 1, then only 10k in batch 2, but both batches could take just as long since some of the files cause skewed processing times. This results in a cluster spending nearly the same amount of time while processing only 1/10 of the files it could have. ### Does this PR introduce _any_ user-facing change? Updated documentation for structured streaming sources to describe new configurations options ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files
panbingkun commented on PR #42622: URL: https://github.com/apache/spark/pull/42622#issuecomment-1689197410 > in what case will this be a problem? > I don't see similar doc build failures in CI of master and branch-3.5 1. The docs file has been generated locally. 2. At this point, the pyspark code has made changes, such as deleting the function 'chr'`. 3. Execute the command to generate docs files locally, and an error will occur at this time. 4. The reason why it doesn't happen on GA is because it's always a new generation on GA. In `conf.py`, both directories `source\reference\api` and `reference\pyspark.pandas\api` have been cleaned, and we should maintain consistency: `reference\pyspark.sql\api` and `reference\pyspark.ss\api`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #41785: [SPARK-44241][Core] Mistakenly set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause incessant executo
yaooqinn commented on code in PR #41785: URL: https://github.com/apache/spark/pull/41785#discussion_r1302410752 ## common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java: ## @@ -276,10 +277,19 @@ public void initChannel(SocketChannel ch) { // Connect to the remote server long preConnect = System.nanoTime(); ChannelFuture cf = bootstrap.connect(address); -if (!cf.await(conf.connectionCreationTimeoutMs())) { + +if (connCreateTimeout <= 0) { + cf.awaitUninterruptibly(); Review Comment: Thank you @JoshRosen. The fix looks good to me. While considering the usage of `await` and `awaitUninterruptibly`, I consulted the Netty documentation (https://netty.io/4.0/api/io/netty/channel/ChannelFuture.html) which recommended the use of `awaitUninterruptibly` in all its GOOD examples. So... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #42608: [SPARK-42017][PYTHON][CONNECT][TESTS] Enable `ColumnParityTests.test_access_column`
zhengruifeng commented on code in PR #42608: URL: https://github.com/apache/spark/pull/42608#discussion_r1302385936 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Colum alias = self._get_alias() if self._plan is None: raise SparkConnectException("Cannot analyze on empty plan.") + +if "." not in item and "*" not in item and "`" not in item and item not in self.columns: Review Comment: I now hesitate whether this is worthwhile @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files
zhengruifeng commented on PR #42622: URL: https://github.com/apache/spark/pull/42622#issuecomment-1689179825 @panbingkun in what case will this be a problem? I don't see similar doc build failures in CI of master and branch-3.5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #42601: [SPARK-44905][SQL] Stateful lastRegex causes NullPointerException on eval for regexp_replace
yaooqinn commented on PR #42601: URL: https://github.com/apache/spark/pull/42601#issuecomment-1689173818 thanks @cloud-fan, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files
panbingkun commented on PR #42622: URL: https://github.com/apache/spark/pull/42622#issuecomment-1689173498 cc @zhengruifeng @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files
panbingkun commented on PR #42622: URL: https://github.com/apache/spark/pull/42622#issuecomment-1689173078 As shown in the following figure, during the document generation process of `sphinx-build`, some directories and files will be automatically generated in the directory: https://github.com/apache/spark/assets/15246973/c679c9d3-a010-4884-8f96-f56bc5fcde4c;> https://github.com/apache/spark/assets/15246973/d04b4330-8784-46a3-9a58-cb27b75dcc97;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn closed pull request #42601: [SPARK-44905][SQL] Stateful lastRegex causes NullPointerException on eval for regexp_replace
yaooqinn closed pull request #42601: [SPARK-44905][SQL] Stateful lastRegex causes NullPointerException on eval for regexp_replace URL: https://github.com/apache/spark/pull/42601 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #42622: [SPARK-44923][PYTHON][DOCS] Some directories should be cleared when regenerating files
panbingkun opened a new pull request, #42622: URL: https://github.com/apache/spark/pull/42622 ### What changes were proposed in this pull request? The pr aims to fix some bug in regenerating pyspark docs in certain scenarios. ### Why are the changes needed? - The following error occurred while I was regenerating the pyspark document. https://github.com/apache/spark/assets/15246973/548abd63-4349-4267-b1fe-a293bd1e7f3e;> - We can simply reproduce this problem as follows: 1.git reset --hard 3f380b9ecc8b27f6965b554061572e0990f0513 https://github.com/apache/spark/assets/15246973/5ab9c8fc-5835-4ced-8d92-9d5e020b262a;> 2.make clean html, at this point, it is successful. https://github.com/apache/spark/assets/15246973/5c3ce07f-cbe8-4177-ae22-b16c3fc62e01;> 3.git pull 4.make clean html, at this point, it is failed. https://github.com/apache/spark/assets/15246973/548abd63-4349-4267-b1fe-a293bd1e7f3e;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1.Pass GA. 2.Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #42614: [SPARK-44922][TESTS] Disable o.a.p.h.InternalParquetRecordWriter logs for tests
yaooqinn commented on PR #42614: URL: https://github.com/apache/spark/pull/42614#issuecomment-1689169814 Hi @dongjoon-hyun @HyukjinKwon, thanks for the suggestion, the associate JIRA ticket has been attached. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42621: [SPARK-43567][FOLLOWUP] Missing backtick from migration guide
HyukjinKwon closed pull request #42621: [SPARK-43567][FOLLOWUP] Missing backtick from migration guide URL: https://github.com/apache/spark/pull/42621 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42621: [SPARK-43567][FOLLOWUP] Missing backtick from migration guide
HyukjinKwon commented on PR #42621: URL: https://github.com/apache/spark/pull/42621#issuecomment-1689170414 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #42428: [SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site
HyukjinKwon closed pull request #42428: [SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site URL: https://github.com/apache/spark/pull/42428 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42428: [SPARK-44742][PYTHON][DOCS] Add Spark version drop down to the PySpark doc site
HyukjinKwon commented on PR #42428: URL: https://github.com/apache/spark/pull/42428#issuecomment-1689168915 Alright, let's see how it gose. Merged to master and branch-3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #42621: [SPARK-43567][FOLLOWUP] Missing backtick from migration guide
itholic opened a new pull request, #42621: URL: https://github.com/apache/spark/pull/42621 ### What changes were proposed in this pull request? This is followup for https://github.com/apache/spark/pull/42270. ### Why are the changes needed? To render the migration guide properly. ### Does this PR introduce _any_ user-facing change? No, it's documentation fix. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #42594: [SPARK-44839][SS][CONNECT] Better Error Logging when user tries to serialize spark session
itholic commented on code in PR #42594: URL: https://github.com/apache/spark/pull/42594#discussion_r1302374470 ## python/pyspark/sql/connect/streaming/query.py: ## @@ -237,7 +239,13 @@ def addListener(self, listener: StreamingQueryListener) -> None: listener._init_listener_id() cmd = pb2.StreamingQueryManagerCommand() expr = proto.PythonUDF() -expr.command = CloudPickleSerializer().dumps(listener) +try: +expr.command = CloudPickleSerializer().dumps(listener) +except pickle.PicklingError: +raise PySparkRuntimeError( Review Comment: > define a new PySparkPicklingError and replace this PySparkRuntimeError with that right? Correct :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #42617: [SPARK-44918][SQL][PYTHON] Support named arguments in scalar Python/Pandas UDFs
ueshin commented on code in PR #42617: URL: https://github.com/apache/spark/pull/42617#discussion_r1302373707 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala: ## @@ -50,6 +50,19 @@ case class UserDefinedPythonFunction( udfDeterministic: Boolean) { def builder(e: Seq[Expression]): Expression = { +if (pythonEvalType == PythonEvalType.SQL_BATCHED_UDF +|| pythonEvalType ==PythonEvalType.SQL_ARROW_BATCHED_UDF +|| pythonEvalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF) { + /* + * Check if the named arguments: + * - don't have duplicated names + * - don't contain positional arguments Review Comment: The third item is not right for Python UDF/UDTF. Currently it relies on Python. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #40932: [SPARK-43266][SQL] Move MergeScalarSubqueries to spark-sql
beliefer commented on PR #40932: URL: https://github.com/apache/spark/pull/40932#issuecomment-1689157078 @peter-toth Got it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #42608: [SPARK-42017][PYTHON][CONNECT][TESTS] Enable `ColumnParityTests.test_access_column`
zhengruifeng commented on code in PR #42608: URL: https://github.com/apache/spark/pull/42608#discussion_r1302368571 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Colum alias = self._get_alias() if self._plan is None: raise SparkConnectException("Cannot analyze on empty plan.") + +if "." not in item and "*" not in item and "`" not in item and item not in self.columns: Review Comment: similar to https://github.com/apache/spark/commit/91e97f92fe76f9718cd16af0c761d5530bdb37ee, but need to cases like `*`, `a.*`, ``aa.bb`` ## python/pyspark/sql/connect/dataframe.py: ## @@ -1623,6 +1624,10 @@ def __getitem__(self, item: Union[int, str, Column, List, Tuple]) -> Union[Colum alias = self._get_alias() if self._plan is None: raise SparkConnectException("Cannot analyze on empty plan.") + +if "." not in item and "*" not in item and "`" not in item and item not in self.columns: Review Comment: similar to https://github.com/apache/spark/commit/91e97f92fe76f9718cd16af0c761d5530bdb37ee, but need to skip cases like `*`, `a.*`, ``aa.bb`` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42618: [SPARK-44919] Avro connector: convert a union of a single primitive type to a StructType
sadikovi commented on code in PR #42618: URL: https://github.com/apache/spark/pull/42618#discussion_r1302363525 ## connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala: ## @@ -142,18 +143,30 @@ object SchemaConverters { if (avroSchema.getTypes.asScala.exists(_.getType == NULL)) { // In case of a union with null, eliminate it and make a recursive call val remainingUnionTypes = AvroUtils.nonNullUnionBranches(avroSchema) - if (remainingUnionTypes.size == 1) { -toSqlTypeHelper(remainingUnionTypes.head, existingRecordNames, avroOptions) - .copy(nullable = true) - } else { -toSqlTypeHelper( - Schema.createUnion(remainingUnionTypes.asJava), - existingRecordNames, - avroOptions).copy(nullable = true) - } + toSqlTypeHelper( +Schema.createUnion(remainingUnionTypes.asJava), +existingRecordNames, +avroOptions).copy(nullable = true) } else avroSchema.getTypes.asScala.map(_.getType).toSeq match { case Seq(t1) => -toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, avroOptions) +// If spark.sql.avro.alwaysConvertUnionToStructType is set to false (default), +// we convert Avro union with a single primitive type into a primitive Spark type +// instead of a StructType. +if (!SQLConf.get.avroAlwaysConvertUnionToStruct) { + toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, avroOptions) Review Comment: Maybe it is good to compare the two approaches and see which one makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42618: [SPARK-44919] Avro connector: convert a union of a single primitive type to a StructType
sadikovi commented on code in PR #42618: URL: https://github.com/apache/spark/pull/42618#discussion_r1302362345 ## connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala: ## @@ -142,18 +143,30 @@ object SchemaConverters { if (avroSchema.getTypes.asScala.exists(_.getType == NULL)) { // In case of a union with null, eliminate it and make a recursive call val remainingUnionTypes = AvroUtils.nonNullUnionBranches(avroSchema) - if (remainingUnionTypes.size == 1) { -toSqlTypeHelper(remainingUnionTypes.head, existingRecordNames, avroOptions) - .copy(nullable = true) - } else { -toSqlTypeHelper( - Schema.createUnion(remainingUnionTypes.asJava), - existingRecordNames, - avroOptions).copy(nullable = true) - } + toSqlTypeHelper( +Schema.createUnion(remainingUnionTypes.asJava), +existingRecordNames, +avroOptions).copy(nullable = true) } else avroSchema.getTypes.asScala.map(_.getType).toSeq match { case Seq(t1) => -toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, avroOptions) +// If spark.sql.avro.alwaysConvertUnionToStructType is set to false (default), +// we convert Avro union with a single primitive type into a primitive Spark type +// instead of a StructType. +if (!SQLConf.get.avroAlwaysConvertUnionToStruct) { + toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, avroOptions) Review Comment: I was going to suggest doing this for Avro schema converter in the Avro data source where this is required because the top schema is always expected to be a struct. For example, we change this code in AvroUtils: ```scala SchemaConverters.toSqlType(avroSchema, options).dataType match { case t: StructType => Some(t) case _ => throw new RuntimeException( s"""Avro schema cannot be converted to a Spark SQL StructType: | |${avroSchema.toString(true)} |""".stripMargin) } ``` to something like this: ```scala SchemaConverters.toSqlType(avroSchema, options).dataType match { case t: StructType => Some(t) case t: AtomicType => StructType(StructField("value", t, nullable = true)) case _ => throw new RuntimeException( s"""Avro schema cannot be converted to a Spark SQL StructType: | |${avroSchema.toString(true)} |""".stripMargin) } ``` Can we also check that this change will not affect `from_avro` SQL function? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #42618: [SPARK-44919] Avro connector: convert a union of a single primitive type to a StructType
sadikovi commented on code in PR #42618: URL: https://github.com/apache/spark/pull/42618#discussion_r1302356178 ## connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala: ## @@ -142,18 +143,30 @@ object SchemaConverters { if (avroSchema.getTypes.asScala.exists(_.getType == NULL)) { // In case of a union with null, eliminate it and make a recursive call val remainingUnionTypes = AvroUtils.nonNullUnionBranches(avroSchema) - if (remainingUnionTypes.size == 1) { -toSqlTypeHelper(remainingUnionTypes.head, existingRecordNames, avroOptions) - .copy(nullable = true) - } else { -toSqlTypeHelper( - Schema.createUnion(remainingUnionTypes.asJava), - existingRecordNames, - avroOptions).copy(nullable = true) - } + toSqlTypeHelper( +Schema.createUnion(remainingUnionTypes.asJava), +existingRecordNames, +avroOptions).copy(nullable = true) } else avroSchema.getTypes.asScala.map(_.getType).toSeq match { case Seq(t1) => -toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, avroOptions) +// If spark.sql.avro.alwaysConvertUnionToStructType is set to false (default), +// we convert Avro union with a single primitive type into a primitive Spark type +// instead of a StructType. +if (!SQLConf.get.avroAlwaysConvertUnionToStruct) { + toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, avroOptions) +} else { + val singleton = avroSchema.getTypes.get(0) + val schemaType = toSqlTypeHelper(singleton, existingRecordNames, avroOptions) + val fieldName = if (avroOptions.useStableIdForUnionType) { +s"member_${singleton.getName.toLowerCase(Locale.ROOT)}" + } else { +s"member0" Review Comment: Could you explore if there is a way to not duplicate the logic of stable identifiers? ## connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala: ## @@ -142,18 +143,30 @@ object SchemaConverters { if (avroSchema.getTypes.asScala.exists(_.getType == NULL)) { // In case of a union with null, eliminate it and make a recursive call val remainingUnionTypes = AvroUtils.nonNullUnionBranches(avroSchema) - if (remainingUnionTypes.size == 1) { -toSqlTypeHelper(remainingUnionTypes.head, existingRecordNames, avroOptions) - .copy(nullable = true) - } else { -toSqlTypeHelper( - Schema.createUnion(remainingUnionTypes.asJava), - existingRecordNames, - avroOptions).copy(nullable = true) - } + toSqlTypeHelper( +Schema.createUnion(remainingUnionTypes.asJava), +existingRecordNames, +avroOptions).copy(nullable = true) } else avroSchema.getTypes.asScala.map(_.getType).toSeq match { case Seq(t1) => -toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, avroOptions) +// If spark.sql.avro.alwaysConvertUnionToStructType is set to false (default), +// we convert Avro union with a single primitive type into a primitive Spark type +// instead of a StructType. +if (!SQLConf.get.avroAlwaysConvertUnionToStruct) { + toSqlTypeHelper(avroSchema.getTypes.get(0), existingRecordNames, avroOptions) Review Comment: I was going to suggest doing this for Avro schema converter in the Avro data source where this is required because the top schema is always expected to be a struct. For example, we change this code in AvroUtils: ```scala // Converts Avro schema to sql type and ensures that the top level data type is either an Avro // record or a complex union that both result in a conversion to StructType def convertAvroToSqlSchema(avroSchema: Schema, avroOptions: AvroOptions): StructType = { SchemaConverters.toSqlTypeHelper(avroSchema, Set.empty, avroOptions).dataType match { case t: StructType => t case _ => throw new RuntimeException( s"""Avro schema cannot be converted to a Spark SQL StructType: | |${avroSchema.toString(true)} |""".stripMargin) } } ``` to something like this: ```scala // Converts Avro schema to sql type and ensures that the top level data type is either an Avro // record or a complex union that both result in a conversion to StructType def convertAvroToSqlSchema(avroSchema: Schema, avroOptions: AvroOptions): StructType = { SchemaConverters.toSqlTypeHelper(avroSchema, Set.empty, avroOptions).dataType match { case t: StructType => t case t: AtomicType =>
[GitHub] [spark] Hisoka-X commented on a diff in pull request #42194: [SPARK-41471][SQL] Reduce Spark shuffle when only one side of a join is KeyGroupedPartitioning
Hisoka-X commented on code in PR #42194: URL: https://github.com/apache/spark/pull/42194#discussion_r1302357936 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala: ## @@ -182,7 +182,16 @@ case class BatchScanExec( // Now fill missing partition keys with empty partitions val partitionMapping = nestGroupedPartitions.toMap - finalPartitions = spjParams.commonPartitionValues.get.flatMap { + + // SPARK-41471: We keep to order of partition keys in `commonPartitionValues` to Review Comment: Thanks for advise! Adressed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #42599: [DO-NOT-MERGE] Remove Guava from shared classes from IsolatedClientLoader
pan3793 commented on code in PR #42599: URL: https://github.com/apache/spark/pull/42599#discussion_r1302355574 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala: ## @@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging { } val hiveArtifacts = version.extraDeps ++ Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde") -.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ - Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames +.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ hadoopJarNames Review Comment: > if we remove Guava from shared classes but also changed the now-non-shared metastore Guava version then it's the same net effect and breaks older Hive versions. Hmm, I don't get the idea. After we remove Guava from the shared class, we don't need to keep Hive and Spark using the same Guava. And we did not add Guava to `HiveVersion#exclusions`, each Hive version should automatically pull all transitive dependencies including Guava if necessary. Anyway, let's simply try (a) first. ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala: ## @@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging { } val hiveArtifacts = version.extraDeps ++ Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde") -.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ - Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames +.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ hadoopJarNames Review Comment: > if we remove Guava from shared classes but also changed the now-non-shared metastore Guava version then it's the same net effect and breaks older Hive versions. Hmm, I don't get the idea. After removing Guava from the shared class, we don't need to keep Hive and Spark using the same Guava. And we did not add Guava to `HiveVersion#exclusions`, each Hive version should automatically pull all transitive dependencies including Guava if necessary. Anyway, let's simply try (a) first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #42599: [DO-NOT-MERGE] Remove Guava from shared classes from IsolatedClientLoader
pan3793 commented on code in PR #42599: URL: https://github.com/apache/spark/pull/42599#discussion_r1302355574 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala: ## @@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging { } val hiveArtifacts = version.extraDeps ++ Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde") -.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ - Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames +.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ hadoopJarNames Review Comment: > if we remove Guava from shared classes but also changed the now-non-shared metastore Guava version then it's the same net effect and breaks older Hive versions. Hmm, I don't get the idea. After we remove Guava from the shared class, then we don't need to keep Hive and Spark using the same Guava. And we did not add Guava to `HiveVersion#exclusions`, each Hive version should automatically pull all transitive dependencies including Guava if necessary. Anyway, let's simply try (a) first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled
panbingkun commented on PR #42513: URL: https://github.com/apache/spark/pull/42513#issuecomment-1689111917 All done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #42620: [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase
amaliujia commented on PR #42620: URL: https://github.com/apache/spark/pull/42620#issuecomment-1689108560 @cloud-fan @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #42620: [SPARK-44921][SQL] Remove SqlBaseLexer.tokens from codebase
amaliujia opened a new pull request, #42620: URL: https://github.com/apache/spark/pull/42620 ### What changes were proposed in this pull request? https://github.com/apache/spark/commit/8ff6b7a04cbaef9c552789ad5550ceab760cb078#diff-f4df4ce19570230091c3b2432e3c84cd2db7059c7b2a03213d272094bd940454 refactors antlr4 files to `sql/api` but checked in `SqlBaseLexer.tokens`. This file is generated so we do not need to check it in. ### Why are the changes needed? Remove file that do not need to be checked in. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing Test. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on a diff in pull request #41785: [SPARK-44241][Core] Mistakenly set io.connectionTimeout/connectionCreationTimeout to zero or negative will cause incessant execut
JoshRosen commented on code in PR #41785: URL: https://github.com/apache/spark/pull/41785#discussion_r1302331832 ## common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java: ## @@ -276,10 +277,19 @@ public void initChannel(SocketChannel ch) { // Connect to the remote server long preConnect = System.nanoTime(); ChannelFuture cf = bootstrap.connect(address); -if (!cf.await(conf.connectionCreationTimeoutMs())) { + +if (connCreateTimeout <= 0) { + cf.awaitUninterruptibly(); Review Comment: Filed https://issues.apache.org/jira/browse/SPARK-44920 and opened https://github.com/apache/spark/pull/42619 to fix this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen opened a new pull request, #42619: [SPARK-44920][CORE] Use await() instead of awaitUninterruptibly() in TransportClientFactory.createClient()
JoshRosen opened a new pull request, #42619: URL: https://github.com/apache/spark/pull/42619 ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/41785 / SPARK-44241 introduced a new `awaitUninterruptibly()` call in one branch of `TrasportClientFactory.createClient()` (executed when the connection create timeout is non-positive). This PR replaces that call with an interruptible `await()` call. Note that the other pre-existing branches in this method were already using `await()`. ### Why are the changes needed? Uninterruptible waiting can cause problems when cancelling tasks. For details, see https://github.com/apache/spark/pull/16866 / SPARK-19529, an older PR fixing a similar issue in this same `TransportClientFactory.createClient()` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #42578: [SPARK-44841][FOLLOWUP] Add migration guide for the behavior change
itholic commented on PR #42578: URL: https://github.com/apache/spark/pull/42578#issuecomment-1689101755 Yeah, at least we have to fix the typo "propotion" to "proportion". Otherwise I don't think it's absolutely necessary, but please feel free to made a change. I have no strong opinion unless it doesn't include the wrong information. Thanks for catching the typo, @bjornjorgensen ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #42617: [SPARK-44918][SQL][PYTHON] Support named arguments in scalar Python/Pandas UDFs
dtenedor commented on code in PR #42617: URL: https://github.com/apache/spark/pull/42617#discussion_r1302322467 ## python/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py: ## @@ -1467,6 +1467,96 @@ def udf(x): finally: shutil.rmtree(path) +def test_named_arguments(self): +@pandas_udf("int") +def test_udf(a, b): +return a + 10 * b + +self.spark.udf.register("test_udf", test_udf) + +for i, df in enumerate( +[ +self.spark.range(2).select(test_udf(a=col("id"), b=col("id") * 10)), +self.spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))), +self.spark.sql("SELECT test_udf(a => id, b => id * 10) FROM range(2)"), Review Comment: can we also have a test case with a positional argument first and then a named argument after? We can have a positive version of this test case where the positional argument maps to the first argument of the UDF and the named argument refers to the second, and another negative version where the named argument uses the same name as the first (positional) argument? Same for `test_udf.py`? ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala: ## @@ -50,6 +50,19 @@ case class UserDefinedPythonFunction( udfDeterministic: Boolean) { def builder(e: Seq[Expression]): Expression = { +if (pythonEvalType == PythonEvalType.SQL_BATCHED_UDF +|| pythonEvalType ==PythonEvalType.SQL_ARROW_BATCHED_UDF +|| pythonEvalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF) { + /* + * Check if the named arguments: + * - don't have duplicated names + * - don't contain positional arguments Review Comment: ```suggestion * - don't contain positional arguments after named arguments * - all map to valid argument names from the function declaration ``` ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala: ## @@ -146,4 +157,30 @@ object PythonUDFRunner { } } } + + def writeUDFs( +dataOut: DataOutputStream, Review Comment: please indent each of these function arguments by 2 more spaces (for a total of +4 spaces each) per the style guide? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled
HyukjinKwon commented on code in PR #42513: URL: https://github.com/apache/spark/pull/42513#discussion_r1302314334 ## python/pyspark/testing/utils.py: ## @@ -539,6 +540,9 @@ def compare_vals(val1, val2): elif isinstance(val1, float) and isinstance(val2, float): if abs(val1 - val2) > (atol + rtol * abs(val2)): return False +elif isinstance(val1, Decimal) and isinstance(val2, Decimal): Review Comment: @asl3 FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled
HyukjinKwon commented on code in PR #42513: URL: https://github.com/apache/spark/pull/42513#discussion_r1302314216 ## python/pyspark/testing/utils.py: ## @@ -411,8 +412,8 @@ def assertDataFrameEqual( Note that schema equality is checked only when `expected` is a DataFrame (not a list of Rows). -For DataFrames with float values, assertDataFrame asserts approximate equality. -Two float values a and b are approximately equal if the following equation is True: +For DataFrames with float/Decimal values, assertDataFrame asserts approximate equality. Review Comment: ```suggestion For DataFrames with float/decimal values, assertDataFrame asserts approximate equality. ``` ## python/pyspark/testing/utils.py: ## @@ -411,8 +412,8 @@ def assertDataFrameEqual( Note that schema equality is checked only when `expected` is a DataFrame (not a list of Rows). -For DataFrames with float values, assertDataFrame asserts approximate equality. -Two float values a and b are approximately equal if the following equation is True: +For DataFrames with float/Decimal values, assertDataFrame asserts approximate equality. +Two float/Decimal values a and b are approximately equal if the following equation is True: Review Comment: ```suggestion Two float/decimal values a and b are approximately equal if the following equation is True: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled
HyukjinKwon commented on code in PR #42513: URL: https://github.com/apache/spark/pull/42513#discussion_r1302314087 ## python/pyspark/sql/functions.py: ## @@ -7802,8 +7802,20 @@ def to_unix_timestamp( >>> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >>> df = spark.createDataFrame([("2016-04-08",)], ["e"]) ->>> df.select(to_unix_timestamp(df.e).alias('r')).collect() Review Comment: I think we can remove this whole example for now. This is a user-facing documentation that shows an example so we won't necessarily show the failure example -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42513: [SPARK-44827][PYTHON][TESTS] Fix test when ansi mode enabled
HyukjinKwon commented on code in PR #42513: URL: https://github.com/apache/spark/pull/42513#discussion_r1302313324 ## python/pyspark/sql/dataframe.py: ## @@ -3912,16 +3912,29 @@ def union(self, other: "DataFrame") -> "DataFrame": >>> df2 = spark.createDataFrame([(3, "Charlie"), (4, "Dave")], ["id", "name"]) >>> df1 = df1.withColumn("age", lit(30)) >>> df2 = df2.withColumn("age", lit(40)) ->>> df3 = df1.union(df2) ->>> df3.show() -+-+---+---+ -| name| id|age| -+-+---+---+ -|Alice| 1| 30| -| Bob| 2| 30| -|3|Charlie| 40| -|4| Dave| 40| -+-+---+---+ +>>> if spark.conf.get("spark.sql.ansi.enabled") == "false": Review Comment: Can we fix the input types, for example, floats and ints so it can pass whether `spark.sql.ansi.enabled` is on or off? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42614: [MINOR][INFRA] Disable o.a.p.h.InternalParquetRecordWriter logs for tests
HyukjinKwon commented on PR #42614: URL: https://github.com/apache/spark/pull/42614#issuecomment-1689075541 yeah, let's file a JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tianhanhu opened a new pull request, #42618: [SPARK-44919] Avro connector: convert a union of a single primitive type to a StructType
tianhanhu opened a new pull request, #42618: URL: https://github.com/apache/spark/pull/42618 ### What changes were proposed in this pull request? This PR adds a new behavior guarded by a new config that changes how a Avro union of a single primitive type is converted to Spark types.. ### Why are the changes needed? Spark Avro data source schema converter currently converts union with a single primitive type to a Spark primitive type instead of a StructType. While for more complex union types that consists of multiple primitive types, the schema converter translate them into StructTypes. For example, ``` import scala.collection.JavaConverters._ import org.apache.avro._ import org.apache.spark.sql.avro._ // ["string", "null"] SchemaConverters.toSqlType( Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)).asJava) ).dataType // ["string", "int", "null"] SchemaConverters.toSqlType( Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.INT), Schema.create(Schema.Type.NULL)).asJava) ).dataType ``` The first one would return StringType, the second would return StructType(StringType, IntegerType). It is undesirable in some cases as we may want to consistently get StructType out of Avro union. ### Does this PR introduce _any_ user-facing change? Yes, a new config `spark.sql.avro.alwaysConvertUnionToStructType` that can but used to define how an Avro union of a single type is converted into Spark types. The default behavior stays the same. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #42603: [SPARK-44907][PYTHON][CONNECT] `DataFrame.join` should throw IllegalArgumentException for invalid join types
zhengruifeng commented on PR #42603: URL: https://github.com/apache/spark/pull/42603#issuecomment-1689073965 thanks! merged to master and branch-3.5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #42603: [SPARK-44907][PYTHON][CONNECT] `DataFrame.join` should throw IllegalArgumentException for invalid join types
zhengruifeng closed pull request #42603: [SPARK-44907][PYTHON][CONNECT] `DataFrame.join` should throw IllegalArgumentException for invalid join types URL: https://github.com/apache/spark/pull/42603 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42578: [SPARK-44841][FOLLOWUP] Add migration guide for the behavior change
HyukjinKwon commented on PR #42578: URL: https://github.com/apache/spark/pull/42578#issuecomment-1689072116 WDYT @itholic ^ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions
allisonwang-db commented on code in PR #42272: URL: https://github.com/apache/spark/pull/42272#discussion_r1302303933 ## python/docs/source/user_guide/sql/python_udtf.rst: ## @@ -0,0 +1,222 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +=== +Python User-defined Table Functions (UDTFs) +=== + +Spark 3.5 introduces Python user-defined table functions (UDTFs), a new type of user-defined function. +Unlike scalar functions that return a single result value, a UDTF is invoked in the FROM clause and returns +an entire relation as output. Each UDTF call can accept zero or more arguments. +These arguments can be scalar constant expressions or separate input relations. + +Implementing a Python UDTF +-- + +.. currentmodule:: pyspark.sql.functions + +To implement a Python UDTF, you can define a class implementing the methods: + +.. code-block:: python + +class PythonUDTF: + +def __init__(self) -> None: +""" +Initialize the user-defined table function (UDTF). + +This method serves as the default constructor and is called once when the +UDTF is instantiated on the executor side. + +Any class fields assigned in this method will be available for subsequent +calls to the `eval` and `terminate` methods. + +Notes +- +- This method does not accept any extra arguments. +- You cannot create or reference the Spark session within the UDTF. Any + attempt to do so will result in a serialization error. +""" +... + +def eval(self, *args: Any) -> Iterator[Any]: +""" +Evaluate the function using the given input arguments. + +This method is required and must be implemented. + +The arguments provided to the UDTF call are mapped to the values in the +`*args` list sequentially. Each provided scalar expression maps to exactly +one value in this `*args` list. Each provided TABLE argument of N columns Review Comment: Updated! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42272: [SPARK-44508][PYTHON][DOCS] Add user guide for Python user-defined table functions
allisonwang-db commented on code in PR #42272: URL: https://github.com/apache/spark/pull/42272#discussion_r1302303763 ## examples/src/main/python/sql/udtf.py: ## @@ -0,0 +1,230 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +""" +A simple example demonstrating Python UDTFs in Spark +Run with: + ./bin/spark-submit examples/src/main/python/sql/udtf.py +""" + +# NOTE that this file is imported in the User Guides in PySpark documentation. +# The codes are referred via line numbers. See also `literalinclude` directive in Sphinx. +from pyspark.sql import SparkSession +from pyspark.sql.pandas.utils import require_minimum_pandas_version, require_minimum_pyarrow_version + +# Python UDTFs use Arrow by default. +require_minimum_pandas_version() +require_minimum_pyarrow_version() + + +def python_udtf_simple_example(spark: SparkSession) -> None: + +# Define the UDTF class and implement the required `eval` method. +class SquareNumbers: +def eval(self, start: int, end: int): # type: ignore[no-untyped-def] +for num in range(start, end + 1): +yield (num, num * num) + +from pyspark.sql.functions import lit, udtf + +# Create a UDTF using the class definition and the `udtf` function. +square_num = udtf(SquareNumbers, returnType="num: int, squared: int") + +# Invoke the UDTF in PySpark. +square_num(lit(1), lit(3)).show() # type: ignore +# +---+--+ +# |num|squred| Review Comment: Good catch! Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on a diff in pull request #42599: [DO-NOT-MERGE] Remove Guava from shared classes from IsolatedClientLoader
JoshRosen commented on code in PR #42599: URL: https://github.com/apache/spark/pull/42599#discussion_r1302282764 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala: ## @@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging { } val hiveArtifacts = version.extraDeps ++ Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde") -.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ - Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames +.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ hadoopJarNames Review Comment: Here, I think the Guava version needs to be a function of `version`, rather than matching Spark's Guava version or omitting Guava: if we remove Guava from shared classes but also changed the now-non-shared metastore Guava version then it's the same net effect and breaks older Hive versions. I believe that newer versions of Hive shade Guava (in which case they're insensitive to whatever value we set here). I think we could either (a) continue to unconditionally use Guava 14.0.1 here, or (b) conditionally use it only for older Hive versions that predated the Guava shading. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JoshRosen commented on a diff in pull request #42599: [DO-NOT-MERGE] Remove Guava from shared classes from IsolatedClientLoader
JoshRosen commented on code in PR #42599: URL: https://github.com/apache/spark/pull/42599#discussion_r1302282764 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala: ## @@ -130,8 +130,7 @@ private[hive] object IsolatedClientLoader extends Logging { } val hiveArtifacts = version.extraDeps ++ Seq("hive-metastore", "hive-exec", "hive-common", "hive-serde") -.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ - Seq("com.google.guava:guava:14.0.1") ++ hadoopJarNames +.map(a => s"org.apache.hive:$a:${version.fullVersion}") ++ hadoopJarNames Review Comment: Here, I think the Guava version needs to be a function of `version`, rather than matching Spark's Guava version: if we remove Guava from shared classes but also changed the now-non-shared metastore Guava version then it's the same net effect and breaks older Hive versions. I believe that newer versions of Hive shade Guava (in which case they're insensitive to whatever value we set here). I think we could either (a) continue to unconditionally use Guava 14.0.1 here, or (b) conditionally use it only for older Hive versions that predated the Guava shading. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a diff in pull request #42194: [SPARK-41471][SQL] Reduce Spark shuffle when only one side of a join is KeyGroupedPartitioning
sunchao commented on code in PR #42194: URL: https://github.com/apache/spark/pull/42194#discussion_r1302282211 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala: ## @@ -182,7 +182,16 @@ case class BatchScanExec( // Now fill missing partition keys with empty partitions val partitionMapping = nestGroupedPartitions.toMap - finalPartitions = spjParams.commonPartitionValues.get.flatMap { + + // SPARK-41471: We keep to order of partition keys in `commonPartitionValues` to Review Comment: I think partition values should be sorted by themselves but just `mergePartitions` in `EnsureRequirements` doesn't maintain the ordering. Instead of doing it here, can we update `InternalRowComparableWrapper.mergePartitions` to make sure the result is ordered? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42377: [SPARK-44622][SQL][CONNECT] Implement error enrichment and setting server-side stacktrace
heyihong commented on code in PR #42377: URL: https://github.com/apache/spark/pull/42377#discussion_r1301851158 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2843,6 +2843,21 @@ object SQLConf { // show full stacktrace in tests but hide in production by default. .createWithDefault(Utils.isTesting) + val SPARK_ENRICH_ERROR_ENABLED = +buildConf("spark.sql.spark.enrichError.enabled") + .doc("When true, it enriches errors with full exception messages on the client side.") + .version("3.5.0") Review Comment: `org.apache.spark.sql.connect.config.Connect` cannot be configured by `spark.conf.set(...)` if I understand correctly? Maybe we can name this to `spark.sql.connect.enrichError.enabled` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42377: [SPARK-44622][SQL][CONNECT] Implement error enrichment and setting server-side stacktrace
heyihong commented on code in PR #42377: URL: https://github.com/apache/spark/pull/42377#discussion_r1301851158 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2843,6 +2843,21 @@ object SQLConf { // show full stacktrace in tests but hide in production by default. .createWithDefault(Utils.isTesting) + val SPARK_ENRICH_ERROR_ENABLED = +buildConf("spark.sql.spark.enrichError.enabled") + .doc("When true, it enriches errors with full exception messages on the client side.") + .version("3.5.0") Review Comment: `org.apache.spark.sql.connect.config.Connect` cannot be configured by `spark.conf.set(...)` if I understand correctly? Maybe we can name this to `spark.sql.connect.jvmStacktrace.enabled` ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2843,6 +2843,21 @@ object SQLConf { // show full stacktrace in tests but hide in production by default. .createWithDefault(Utils.isTesting) + val SPARK_ENRICH_ERROR_ENABLED = +buildConf("spark.sql.spark.enrichError.enabled") + .doc("When true, it enriches errors with full exception messages on the client side.") + .version("3.5.0") Review Comment: `org.apache.spark.sql.connect.config.Connect` cannot be configured by `spark.conf.set(...)` if I understand correctly? Maybe we can name this to `spark.sql.spark.enrichError.enabled` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on pull request #42617: [SPARK-44918][SQL][PYTHON] Support named arguments in scalar Python/Pandas UDFs
ueshin commented on PR #42617: URL: https://github.com/apache/spark/pull/42617#issuecomment-1689030291 cc @dtenedor @HyukjinKwon @xinrong-meng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #42595: [SPARK-44901][SQL] Add API in Python UDTF 'analyze' method to return partitioning/ordering expressions
dtenedor commented on PR #42595: URL: https://github.com/apache/spark/pull/42595#issuecomment-1689028968 cc @allisonwang-db @ueshin @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zeruibao commented on a diff in pull request #42503: [SPARK-43380][SQL] Fix Avro data type conversion issues without causing performance regression
zeruibao commented on code in PR #42503: URL: https://github.com/apache/spark/pull/42503#discussion_r1302261323 ## connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala: ## @@ -128,6 +133,36 @@ private[sql] class AvroDeserializer( case (INT, IntegerType) => (updater, ordinal, value) => updater.setInt(ordinal, value.asInstanceOf[Int]) + case (LONG, dt: TimestampType) Review Comment: oh, that's a good idea! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #42503: [SPARK-43380][SQL] Fix Avro data type conversion issues without causing performance regression
gengliangwang commented on code in PR #42503: URL: https://github.com/apache/spark/pull/42503#discussion_r1302257921 ## connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala: ## @@ -128,6 +133,36 @@ private[sql] class AvroDeserializer( case (INT, IntegerType) => (updater, ordinal, value) => updater.setInt(ordinal, value.asInstanceOf[Int]) + case (LONG, dt: TimestampType) Review Comment: TimestampType/TimestampNTZType/DateType are all of DatetimeType. We can reduce duplicated code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zeruibao commented on a diff in pull request #42503: [SPARK-43380][SQL] Fix Avro data type conversion issues without causing performance regression
zeruibao commented on code in PR #42503: URL: https://github.com/apache/spark/pull/42503#discussion_r1302256752 ## connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala: ## @@ -128,6 +133,36 @@ private[sql] class AvroDeserializer( case (INT, IntegerType) => (updater, ordinal, value) => updater.setInt(ordinal, value.asInstanceOf[Int]) + case (LONG, dt: TimestampType) Review Comment: Is `DatetimeType` the same as `TimestampType`? Why using `DatetimeType` is better? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin opened a new pull request, #42617: [SPARK-44918][SQL][PYTHON] Support named arguments in scalar Python/Pandas UDFs
ueshin opened a new pull request, #42617: URL: https://github.com/apache/spark/pull/42617 ### What changes were proposed in this pull request? Supports named arguments in scalar Python/Pandas UDF. For example: ```py >>> @udf("int") ... def test_udf(a, b): ... return a + 10 * b ... >>> spark.udf.register("test_udf", test_udf) >>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show() +-+ |test_udf(b => (id * 10), a => id)| +-+ |0| | 101| +-+ >>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show() +-+ |test_udf(b => (id * 10), a => id)| +-+ |0| | 101| +-+ ``` or: ```py >>> @pandas_udf("int") ... def test_udf(a, b): ... return a + 10 * b ... >>> spark.udf.register("test_udf", test_udf) >>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show() +-+ |test_udf(b => (id * 10), a => id)| +-+ |0| | 101| +-+ >>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show() +-+ |test_udf(b => (id * 10), a => id)| +-+ |0| | 101| +-+ ``` ### Why are the changes needed? Now that named arguments support was added (https://github.com/apache/spark/pull/41796, https://github.com/apache/spark/pull/42020). Scalar Python/Pandas UDFs can support it. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for scalar Python/Pandas UDFs. ### How was this patch tested? Added related tests. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42615: [SPARK-44916][DOCS][TESTS] Document Spark Driver Live Log UI
dongjoon-hyun commented on PR #42615: URL: https://github.com/apache/spark/pull/42615#issuecomment-1688930713 Merged to master for Apache Spark 4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #42615: [SPARK-44916][DOCS][TESTS] Document Spark Driver Live Log UI
dongjoon-hyun closed pull request #42615: [SPARK-44916][DOCS][TESTS] Document Spark Driver Live Log UI URL: https://github.com/apache/spark/pull/42615 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42615: [SPARK-44916][DOCS][TESTS] Document Spark Driver Live Log UI
dongjoon-hyun commented on PR #42615: URL: https://github.com/apache/spark/pull/42615#issuecomment-1688925615 Thank you, @viirya ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #42615: [SPARK-44916][DOCS][TESTS] Document Spark Driver Live Log UI
dongjoon-hyun commented on PR #42615: URL: https://github.com/apache/spark/pull/42615#issuecomment-1688915353 Could you review this PR, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40390: [SPARK-42768][SQL] Enable cached plan apply AQE by default
dongjoon-hyun commented on PR #40390: URL: https://github.com/apache/spark/pull/40390#issuecomment-1688904895 Thank you, @ulysses-you and @cloud-fan . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org