[GitHub] [spark] Hisoka-X commented on a diff in pull request #42979: [SPARK-45035][SQL] Fix ignoreCorruptFiles with multiline CSV/JSON will report error
Hisoka-X commented on code in PR #42979: URL: https://github.com/apache/spark/pull/42979#discussion_r1331553477 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala: ## @@ -190,12 +191,19 @@ object MultiLineCSVDataSource extends CSVDataSource { parsedOptions: CSVOptions): StructType = { val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions) csv.flatMap { lines => - val path = new Path(lines.getPath()) - UnivocityParser.tokenizeStream( - CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, path), -shouldDropHeader = false, -new CsvParser(parsedOptions.asParserSettings), -encoding = parsedOptions.charset) + try { +val path = new Path(lines.getPath()) +UnivocityParser.tokenizeStream( + CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, path), + shouldDropHeader = false, + new CsvParser(parsedOptions.asParserSettings), + encoding = parsedOptions.charset) + } catch { +case e @ (_: RuntimeException | _: IOException) if parsedOptions.ignoreCorruptFiles => + logWarning( +s"Skipped the rest of the content in the corrupted file: ${lines.getPath()}", e) Review Comment: How change it to this? ```scala case e: TextParsingException if parsedOptions.ignoreCorruptFiles && e.getCause.getCause.isInstanceOf[EOFException] => ``` @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #42916: [MiNOR][DOCS] Fix a typo in HashAggregateExec.scala
HyukjinKwon commented on PR #42916: URL: https://github.com/apache/spark/pull/42916#issuecomment-1727123248 We have our own logic to detect forked repostiories' github actions run. You would need to go to settings in your forked repo, and enable it. For now, seems I can't find the Githuh Actions runs in your repo: ![Screenshot 2023-09-20 at 4 25 57 PM](https://github.com/apache/spark/assets/6477701/b552aae8-9445-4225-ba74-cc164332782c) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference
zhengruifeng commented on PR #43013: URL: https://github.com/apache/spark/pull/43013#issuecomment-1727400614 `pyspark.ml.connect` only supports a small subset of `pyspark.ml` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #43003: [SPARK-45226][PYTHON][DOCS] Refine docstring of `rand/randn`
zhengruifeng closed pull request #43003: [SPARK-45226][PYTHON][DOCS] Refine docstring of `rand/randn` URL: https://github.com/apache/spark/pull/43003 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #41016: [SPARK-43341][SQL] Patch StructType.toDDL not picking up on non-nullability of nested column
HyukjinKwon commented on PR #41016: URL: https://github.com/apache/spark/pull/41016#issuecomment-1727125449 @BramBoog it has a conflicts against the lastest master branch. You would need to resolve the conflicts by git fetch upstream & git rebase upstream/master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #43003: [SPARK-45226][PYTHON][DOCS] Refine docstring of `rand/randn`
zhengruifeng commented on PR #43003: URL: https://github.com/apache/spark/pull/43003#issuecomment-1727125799 thanks, merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution
cloud-fan commented on code in PR #42199: URL: https://github.com/apache/spark/pull/42199#discussion_r1331581159 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -77,6 +79,11 @@ object SQLExecution { } val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong executionIdToQueryExecution.put(executionId, queryExecution) +val originalInterruptOnCancel = sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL) +if (originalInterruptOnCancel == null) { + val interruptOnCancel = sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL) + sc.setInterruptOnCancel(interruptOnCancel) Review Comment: This has some side effects: if a task fails many times, we will abort its stage and interrupt other running tasks if this is set to true. However, if the task does file scans, then interrupting the task means we won't trigger the task completion callback which frees file streams. I think this will lead to memory leak, as runtime execution error is common, especially with ansi mode. How does thriftserver leverage this feature? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
peter-toth commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331134534 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala: ## @@ -442,6 +442,10 @@ case class InSubquery(values: Seq[Expression], query: ListQuery) // scalastyle:on line.size.limit case class In(value: Expression, list: Seq[Expression]) extends Predicate { + def this(valueAndList: Seq[Expression]) = { Review Comment: All right, reverted in https://github.com/apache/spark/pull/42864/commits/8bf64a7f934a537ae00421c54a3b115a25c21168. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dzhigimont commented on pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0
dzhigimont commented on PR #40420: URL: https://github.com/apache/spark/pull/40420#issuecomment-1727680820 > @dzhigimont Can we just make the CI pass for now? I can help in the follow-ups after merging this one. > > Seems like the mypy checks is failing for now: > > ``` > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/pandas/namespace.py:162: error: Cannot assign multiple types to name "_range" without an explicit "Type[...]" annotation [misc] > python/pyspark/pandas/indexes/base.py:2075: error: Unused "type: ignore" comment > python/pyspark/pandas/indexes/base.py:2145: error: Unused "type: ignore" comment > Found 3 errors in 2 files (checked 703 source files) > ``` > > To resolve them: > > * [ ] Remove "type: ignore" comment from python/pyspark/pandas/indexes/base.py:2075 > * [ ] Remove "type: ignore" comment from python/pyspark/pandas/indexes/base.py:2145 > * [ ] Add "# type: ignore" comment to python/pyspark/pandas/namespace.py:162 I can't understand when I added Add "# type: ignore" comment to python/pyspark/pandas/namespace.py:162 mypy raised an error unused type: ignore when I deleted it raised `python/pyspark/pandas/namespace.py:161: error: Cannot assign multiple types to name "_range" without an explicit "Type[...]" annotation [misc]` what do you suggest to me? How i can resolve the problem -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`
LuciferYang opened a new pull request, #43015: URL: https://github.com/apache/spark/pull/43015 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
zhengruifeng commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331147608 ## sql/core/src/main/scala/org/apache/spark/sql/functions.scala: ## @@ -414,12 +407,13 @@ object functions { * @group agg_funcs * @since 1.3.0 */ - def count(e: Column): Column = withAggregateFunction { -e.expr match { + def count(e: Column): Column = { +val withoutStar = e.expr match { // Turn count(*) into count(1) Review Comment: No, spark connect doesn't hit such issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dzhigimont commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0
dzhigimont commented on code in PR #40420: URL: https://github.com/apache/spark/pull/40420#discussion_r1331412813 ## python/pyspark/pandas/indexes/datetimes.py: ## @@ -214,28 +215,8 @@ def microsecond(self) -> Index: ) return Index(self.to_series().dt.microsecond) -@property -def week(self) -> Index: -""" -The week ordinal of the year. - -.. deprecated:: 3.5.0 -""" -warnings.warn( -"`week` is deprecated in 3.5.0 and will be removed in 4.0.0.", -FutureWarning, -) -return Index(self.to_series().dt.week) - -@property -def weekofyear(self) -> Index: -warnings.warn( -"`weekofyear` is deprecated in 3.5.0 and will be removed in 4.0.0.", -FutureWarning, -) -return Index(self.to_series().dt.weekofyear) - -weekofyear.__doc__ = week.__doc__ +def isocalendar(self) -> DataFrame: Review Comment: Added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #42993: [SPARK-45231][INFRA] Remove unrecognized and meaningless command about amm from the GA testing workflow.
panbingkun commented on PR #42993: URL: https://github.com/apache/spark/pull/42993#issuecomment-1727147543 cc @vicennial @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42377: [SPARK-44622][SQL][CONNECT] Implement FetchErrorDetails RPC
heyihong commented on code in PR #42377: URL: https://github.com/apache/spark/pull/42377#discussion_r1323060159 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/ClientStreamingQuerySuite.scala: ## @@ -175,6 +175,36 @@ class ClientStreamingQuerySuite extends QueryTest with SQLHelper with Logging { } } + test("throw exception in streaming") { +val session = spark +import session.implicits._ + +val checkForTwo = udf((value: Int) => { + if (value == 2) { +throw new RuntimeException("Number 2 encountered!") + } + value +}) + +val query = spark.readStream + .format("rate") + .option("rowsPerSecond", "1") + .load() + .select(checkForTwo($"value").as("checkedValue")) + .writeStream + .outputMode("append") + .format("console") + .start() + +val exception = intercept[SparkException] { + query.awaitTermination() +} + +assert( + exception.getCause.getCause.getCause.getMessage Review Comment: 1. org.apache.spark.SparkException: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 0 in stage 1384542.0 failed 4 times, most recent failure: Lost task 0.3 in stage ... 2. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1384542.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1384542.0 ... 3. org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (`$Lambda$19844/1935914328`: (int) => int). 4. org.apache.spark.SparkException: java.lang.RuntimeException: Number 2 encountered! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method
beliefer commented on code in PR #42612: URL: https://github.com/apache/spark/pull/42612#discussion_r1331177757 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala: ## @@ -279,7 +283,9 @@ case class StaticInvoke( inputTypes: Seq[AbstractDataType] = Nil, propagateNull: Boolean = true, returnNullable: Boolean = true, -isDeterministic: Boolean = true) extends InvokeLike { +isDeterministic: Boolean = true, +scalarFunctionName: Option[String] = None, Review Comment: @cloud-fan @sunchao Although adding `ScalarFunction` as a new parameter is convenient, it makes `StaticInvoke` look very strange. We already call the `scalarFunc.getClass`, `scalarFunc.resultType()`, `scalarFunc.isResultNullable` and `scalarFunc.isDeterministic` before passed the new `ScalarFunction`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
peter-toth commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331301998 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingSymmetricHashJoinHelperSuite.scala: ## @@ -49,7 +44,12 @@ class StreamingSymmetricHashJoinHelperSuite extends StreamTest { test("only literals") { // Literal-only conjuncts end up on the left side because that's the first bucket they fit in. // There's no semantic reason they couldn't be in any bucket. -val predicate = (lit(1) < lit(5) && lit(6) < lit(7) && lit(0) === lit(-1)).expr +val predicate = + And( +And( + LessThan(lit(1).expr, lit(5).expr), + LessThan(lit(6).expr, lit(7).expr)), +EqualTo(lit(0).expr, lit(-1).expr)) Review Comment: fixed in https://github.com/apache/spark/pull/42864/commits/580f97b0e458a87509ccf6882075cf7330062d54 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dzhigimont commented on pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0
dzhigimont commented on PR #40420: URL: https://github.com/apache/spark/pull/40420#issuecomment-1727427143 > @dzhigimont Can we just make the CI pass for now? I can help in the follow-ups after merging this one. > > Seems like the mypy checks is failing for now: > > ``` > starting mypy annotations test... > annotations failed mypy checks: > python/pyspark/pandas/namespace.py:162: error: Cannot assign multiple types to name "_range" without an explicit "Type[...]" annotation [misc] > python/pyspark/pandas/indexes/base.py:2075: error: Unused "type: ignore" comment > python/pyspark/pandas/indexes/base.py:2145: error: Unused "type: ignore" comment > Found 3 errors in 2 files (checked 703 source files) > ``` > > To resolve them: > > * [ ] Remove "type: ignore" comment from python/pyspark/pandas/indexes/base.py:2075 > * [ ] Remove "type: ignore" comment from python/pyspark/pandas/indexes/base.py:2145 > * [ ] Add "# type: ignore" comment to python/pyspark/pandas/namespace.py:162 Fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on pull request #42382: [ML] Remove usage of RDD APIs for load/save in spark-ml
WeichenXu123 commented on PR #42382: URL: https://github.com/apache/spark/pull/42382#issuecomment-1727681685 @zhengruifeng Can we make the interface `saveMetadata` support both `sparkContext` and `sparkSession` argument ? and in spark repo, we always pass sparkSession as the argument. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
cloud-fan commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331145071 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingSymmetricHashJoinHelperSuite.scala: ## @@ -49,7 +44,12 @@ class StreamingSymmetricHashJoinHelperSuite extends StreamTest { test("only literals") { // Literal-only conjuncts end up on the left side because that's the first bucket they fit in. // There's no semantic reason they couldn't be in any bucket. -val predicate = (lit(1) < lit(5) && lit(6) < lit(7) && lit(0) === lit(-1)).expr +val predicate = + And( +And( + LessThan(lit(1).expr, lit(5).expr), + LessThan(lit(6).expr, lit(7).expr)), +EqualTo(lit(0).expr, lit(-1).expr)) Review Comment: shall we use dsl to build expressions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331297747 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/ClientStreamingQuerySuite.scala: ## @@ -175,6 +175,37 @@ class ClientStreamingQuerySuite extends QueryTest with SQLHelper with Logging { } } + test("throw exception in streaming") { +val session = spark +import session.implicits._ + +val checkForTwo = udf((value: Int) => { + if (value == 2) { +throw new RuntimeException("Number 2 encountered!") + } + value +}) + +val query = spark.readStream + .format("rate") + .option("rowsPerSecond", "1") + .load() + .select(checkForTwo($"value").as("checkedValue")) + .writeStream + .outputMode("append") + .format("console") + .start() + +val exception = intercept[SparkException] { + query.awaitTermination() +} + +assert( + exception.getCause.getCause.getCause.getMessage Review Comment: https://github.com/apache/spark/pull/42377#discussion_r1323060159 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/ClientStreamingQuerySuite.scala: ## @@ -175,6 +175,37 @@ class ClientStreamingQuerySuite extends QueryTest with SQLHelper with Logging { } } + test("throw exception in streaming") { +val session = spark +import session.implicits._ + +val checkForTwo = udf((value: Int) => { + if (value == 2) { +throw new RuntimeException("Number 2 encountered!") + } + value +}) + +val query = spark.readStream + .format("rate") + .option("rowsPerSecond", "1") + .load() + .select(checkForTwo($"value").as("checkedValue")) + .writeStream + .outputMode("append") + .format("console") + .start() + +val exception = intercept[SparkException] { + query.awaitTermination() +} + +assert( + exception.getCause.getCause.getCause.getMessage Review Comment: https://github.com/apache/spark/pull/42377#discussion_r1323060159 I will add the checks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution
yaooqinn commented on code in PR #42199: URL: https://github.com/apache/spark/pull/42199#discussion_r1331623060 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -77,6 +79,11 @@ object SQLExecution { } val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong executionIdToQueryExecution.put(executionId, queryExecution) +val originalInterruptOnCancel = sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL) +if (originalInterruptOnCancel == null) { + val interruptOnCancel = sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL) + sc.setInterruptOnCancel(interruptOnCancel) Review Comment: I don't quite follow, does this PR just leverage a feature that SC already have? https://issues.apache.org/jira/browse/HDFS-1208 related? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #43010: [WIP]
amaliujia opened a new pull request, #43010: URL: https://github.com/apache/spark/pull/43010 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
cloud-fan commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331143291 ## sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: ## @@ -723,11 +728,14 @@ object IntegratedUDFTestUtils extends SQLHelper { override def builder(e: Seq[Expression]): Expression = { assert(e.length == 1, "Defined UDF only has one column") val expr = e.head -assert(expr.resolved, "column should be resolved to use the same type " + Review Comment: I'm a bit confused about this change. IIUC we always call `builder` with resolved expressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request, #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference
HyukjinKwon opened a new pull request, #43013: URL: https://github.com/apache/spark/pull/43013 ### What changes were proposed in this pull request? This PR proposes to add a couple of notes about which modules are supported by Spark Connect. ### Why are the changes needed? In order for users to explicitly know which ones are supported in PySpark with Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, this exposes some notes for PySpark API with Spark Connect. ### How was this patch tested? Manually built the site, and checked. ![Screenshot 2023-09-20 at 6 05 54 PM](https://github.com/apache/spark/assets/6477701/64355af1-f8b2-46bf-8b2a-3ea519995272) ![Screenshot 2023-09-20 at 6 05 28 PM](https://github.com/apache/spark/assets/6477701/c6f2af70-ec6a-4644-a848-30eedd4f56cf) ![Screenshot 2023-09-20 at 6 05 32 PM](https://github.com/apache/spark/assets/6477701/b701f5c0-477d-4f7c-9c02-96c307bf14e2) ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #43008: [SPARK-44113][BUILD] Drop Scala 2.12 Support
LuciferYang opened a new pull request, #43008: URL: https://github.com/apache/spark/pull/43008 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference
HyukjinKwon commented on PR #43013: URL: https://github.com/apache/spark/pull/43013#issuecomment-1727296619 Build: https://github.com/HyukjinKwon/spark/actions/runs/6245820107/job/16955248336 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #43002: [SPARK-43498][PS][TESTS] Enable `StatsTests.test_axis_on_dataframe` for pandas 2.0.0.
HyukjinKwon commented on PR #43002: URL: https://github.com/apache/spark/pull/43002#issuecomment-1727167048 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
peter-toth commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331311772 ## sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: ## @@ -723,11 +728,14 @@ object IntegratedUDFTestUtils extends SQLHelper { override def builder(e: Seq[Expression]): Expression = { assert(e.length == 1, "Defined UDF only has one column") val expr = e.head -assert(expr.resolved, "column should be resolved to use the same type " + Review Comment: In https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonUDFSuite.scala#L41 the `+` is unresolved now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43007: [SPARK-45229][CORE][UI] Show the number of drivers waiting in SUBMITTED status in MasterPage
dongjoon-hyun commented on PR #43007: URL: https://github.com/apache/spark/pull/43007#issuecomment-1727060848 Thank you for revising the PR title. Since `core` module test passed and I verified manually, let me merge this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #43002: [SPARK-43498][PS][TESTS] Enable `StatsTests.test_axis_on_dataframe` for pandas 2.0.0.
HyukjinKwon closed pull request #43002: [SPARK-43498][PS][TESTS] Enable `StatsTests.test_axis_on_dataframe` for pandas 2.0.0. URL: https://github.com/apache/spark/pull/43002 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #43011: [SPARK-45232][DOC] Add missing function groups to SQL references
zhengruifeng commented on code in PR #43011: URL: https://github.com/apache/spark/pull/43011#discussion_r1331200011 ## sql/gen-sql-functions-docs.py: ## @@ -34,6 +34,8 @@ "math_funcs", "conditional_funcs", "generator_funcs", "predicate_funcs", "string_funcs", "misc_funcs", "bitwise_funcs", "conversion_funcs", "csv_funcs", +"xml_funcs", "lambda_funcs", "collection_funcs", +"url_funcs", "hash_funcs", Review Comment: manually check with ``` ag --scala 'group = \"' sql ``` all groups but `table_funcs` should be here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`
LuciferYang commented on PR #43015: URL: https://github.com/apache/spark/pull/43015#issuecomment-1727645034 cc @dongjoon-hyun FYI I think this one need to backport to branch-3.4 and branch-3.5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
hvanhovell commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331644650 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: Can you rebase and show me it is not an issue anymore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #42612: [SPARK-44913][SQL] DS V2 supports push down V2 UDF that has magic method
beliefer commented on code in PR #42612: URL: https://github.com/apache/spark/pull/42612#discussion_r1331177757 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala: ## @@ -279,7 +283,9 @@ case class StaticInvoke( inputTypes: Seq[AbstractDataType] = Nil, propagateNull: Boolean = true, returnNullable: Boolean = true, -isDeterministic: Boolean = true) extends InvokeLike { +isDeterministic: Boolean = true, +scalarFunctionName: Option[String] = None, Review Comment: @cloud-fan @sunchao Although adding `ScalarFunction` as a new parameter is convenient, it makes StaticInvoke look very strange. We already call the `scalarFunc.getClass`, `scalarFunc.resultType()`, `scalarFunc.isResultNullable` and `scalarFunc.isDeterministic` before passed the new `ScalarFunction`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #43008: [WIP][SPARK-44113][BUILD][INFRA][DOCS] Drop support for Scala 2.12
LuciferYang commented on code in PR #43008: URL: https://github.com/apache/spark/pull/43008#discussion_r1331207849 ## dev/change-scala-version.sh: ## @@ -19,7 +19,7 @@ set -e -VALID_VERSIONS=( 2.12 2.13 ) +VALID_VERSIONS=( 2.13 ) Review Comment: No further modifications were made to this file, just made Scala-2.12 an invalid option. We can refactor this script in the future when we need to support multiple Scala versions again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
zhengruifeng commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331149367 ## sql/core/src/main/scala/org/apache/spark/sql/functions.scala: ## @@ -414,12 +407,13 @@ object functions { * @group agg_funcs * @since 1.3.0 */ - def count(e: Column): Column = withAggregateFunction { -e.expr match { + def count(e: Column): Column = { +val withoutStar = e.expr match { // Turn count(*) into count(1) Review Comment: https://github.com/apache/spark/blob/89041a4a8c7b7787fa10f090d4324f20447c4dd3/python/pyspark/sql/connect/functions.py#L1014-L1015 https://github.com/apache/spark/blob/89041a4a8c7b7787fa10f090d4324f20447c4dd3/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala#L391 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #43011: [SPARK-45232][DOC] Add missing function groups to SQL references
zhengruifeng opened a new pull request, #43011: URL: https://github.com/apache/spark/pull/43011 ### What changes were proposed in this pull request? Add missing function groups to SQL references: - xml_funcs - lambda_funcs - collection_funcs - url_funcs - hash_funcsx Note that this PR doesn't fix `table_funcs`: 1, `gen-sql-functions-docs.py` doesn't work properly with `TableFunctionRegistry`, I took a cursory look but fail to fix it; 2, table functions except `range` (e.g. `explode`) were already contained in `Generator Functions`, not sure we need to show them twice. ### Why are the changes needed? when I am referring to the SQL references, I find many functions are missing https://spark.apache.org/docs/latest/sql-ref-functions.html. ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? manually check ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
cloud-fan commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331079422 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala: ## @@ -442,6 +442,10 @@ case class InSubquery(values: Seq[Expression], query: ListQuery) // scalastyle:on line.size.limit case class In(value: Expression, list: Seq[Expression]) extends Predicate { + def this(valueAndList: Seq[Expression]) = { Review Comment: I'm not sure if this is useful. `in(a, b, c)` looks pretty weird to me. Can we revert it and treat `def in` as a special case? The SQL parser also treat `IN` as a special case and has dedicated syntax for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331243346 ## connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala: ## @@ -93,33 +179,65 @@ private[client] object GrpcExceptionConverter extends JsonUtils { new SparkArrayIndexOutOfBoundsException(message)), errorConstructor[DateTimeException]((message, _) => new SparkDateTimeException(message)), errorConstructor((message, cause) => new SparkRuntimeException(message, cause)), -errorConstructor((message, cause) => new SparkUpgradeException(message, cause))) - - private def errorInfoToThrowable(info: ErrorInfo, message: String): Option[Throwable] = { -val classes = - mapper.readValue(info.getMetadataOrDefault("classes", "[]"), classOf[Array[String]]) +errorConstructor((message, cause) => new SparkUpgradeException(message, cause)), +errorConstructor((message, cause) => new SparkException(message, cause.orNull))) + + /** + * errorsToThrowable reconstructs the exception based on a list of protobuf messages + * FetchErrorDetailsResponse.Error with un-truncated error messages and server-side stacktrace + * (if set). + */ + private def errorsToThrowable( + errorIdx: Int, + errors: Seq[FetchErrorDetailsResponse.Error]): Throwable = { + +val error = errors(errorIdx) + +val classHierarchy = error.getErrorTypeHierarchyList.asScala + +val constructor = + classHierarchy +.flatMap(errorFactory.get) +.headOption +.getOrElse((message: String, cause: Option[Throwable]) => + new SparkException( Review Comment: The message pattern is from https://github.com/apache/spark/pull/42377#discussion_r1329224205. But this one seems to be simpler -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #42971: [SPARK-43979][SQL][FOLLOWUP] Handle non alias-only project case
cloud-fan closed pull request #42971: [SPARK-43979][SQL][FOLLOWUP] Handle non alias-only project case URL: https://github.com/apache/spark/pull/42971 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #43014: [SPARK-45235][CONNECT][PYTHON] Support map and array parameters by `sql()`
zhengruifeng commented on code in PR #43014: URL: https://github.com/apache/spark/pull/43014#discussion_r1331636427 ## python/pyspark/sql/connect/plan.py: ## @@ -1049,21 +1049,23 @@ def __init__(self, query: str, args: Optional[Union[Dict[str, Any], List]] = Non self._query = query self._args = args +def __to_expr(self, session: "SparkConnectClient", v: Any) -> proto.Expression: Review Comment: I think we'd better rename it `_to_expr`. [the python standard](https://peps.python.org/pep-0008/#method-names-and-instance-variables) say: > Note: there is some controversy about the use of __names > In addition, the following special forms using leading or trailing underscores are recognized (these can generally be combined with any case convention): > `_single_leading_underscore`: weak “internal use” indicator. E.g. from M import * does not import objects whose names start with an underscore. > `single_trailing_underscore_`: used by convention to avoid conflicts with Python keyword, e.g. tkinter.Toplevel(master, class_='ClassName') > `__double_leading_underscore`: when naming a class attribute, invokes name mangling (inside class FooBar, __boo becomes _FooBar__boo; see below). > `__double_leading_and_trailing_underscore__`: “magic” objects or attributes that live in user-controlled namespaces. E.g. __init__, __import__ or __file__. Never invent such names; only use them as documented. ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -1237,13 +1237,23 @@ def test_sql(self): self.assertEqual(1, len(pdf.index)) def test_sql_with_named_args(self): -df = self.connect.sql("SELECT * FROM range(10) WHERE id > :minId", args={"minId": 7}) -df2 = self.spark.sql("SELECT * FROM range(10) WHERE id > :minId", args={"minId": 7}) +from pyspark.sql.functions import create_map, lit +from pyspark.sql.connect.functions import lit as clit +from pyspark.sql.connect.functions import create_map as ccreate_map Review Comment: https://github.com/apache/spark/blob/8c27de68756d4b0e5940211340a0b323d808aead/python/pyspark/sql/tests/connect/test_connect_basic.py#L78-L79 I think you can use already imported `SF` and `CF`, to be consistent with other tests ## python/pyspark/sql/connect/plan.py: ## @@ -1049,21 +1049,23 @@ def __init__(self, query: str, args: Optional[Union[Dict[str, Any], List]] = Non self._query = query self._args = args +def __to_expr(self, session: "SparkConnectClient", v: Any) -> proto.Expression: Review Comment: I think we'd better rename it `_to_expr`. [the python standard](https://peps.python.org/pep-0008/#method-names-and-instance-variables) say: > Note: there is some controversy about the use of __names > In addition, the following special forms using leading or trailing underscores are recognized (these can generally be combined with any case convention): > `_single_leading_underscore`: weak “internal use” indicator. E.g. from M import * does not import objects whose names start with an underscore. > `single_trailing_underscore_`: used by convention to avoid conflicts with Python keyword, e.g. tkinter.Toplevel(master, class_='ClassName') > `__double_leading_underscore`: when naming a class attribute, invokes name mangling (inside class FooBar, __boo becomes _FooBar__boo; see below). > `__double_leading_and_trailing_underscore__`: “magic” objects or attributes that live in user-controlled namespaces. E.g. __init__, __import__ or __file__. Never invent such names; only use them as documented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on a diff in pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support
bjornjorgensen commented on code in PR #43005: URL: https://github.com/apache/spark/pull/43005#discussion_r1331229836 ## .github/workflows/build_coverage.yml: ## @@ -17,7 +17,7 @@ # under the License. # -name: "Build / Coverage (master, Scala 2.12, Hadoop 3, JDK 8)" +name: "Build / Coverage (master, Scala 2.12, Hadoop 17, JDK 8)" Review Comment: NO.. dont change hadoop from 3 to 17 but JDK 8 to 17 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references
zhengruifeng commented on code in PR #43011: URL: https://github.com/apache/spark/pull/43011#discussion_r1331213799 ## sql/gen-sql-functions-docs.py: ## @@ -34,6 +34,8 @@ "math_funcs", "conditional_funcs", "generator_funcs", "predicate_funcs", "string_funcs", "misc_funcs", "bitwise_funcs", "conversion_funcs", "csv_funcs", +"xml_funcs", "lambda_funcs", "collection_funcs", +"url_funcs", "hash_funcs", "struct_funcs", Review Comment: check against https://github.com/apache/spark/blob/37ab190dc5bfa59b4e06af9551c35ab179a05733/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java#L43-L48 two difference: 1, `table_funcs`: not support in `gen-sql-functions-docs.py`; 2, `binary_funcs`: I can not find any function using this group; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dzhigimont commented on a diff in pull request #40420: [SPARK-42617][PS] Support `isocalendar` from the pandas 2.0.0
dzhigimont commented on code in PR #40420: URL: https://github.com/apache/spark/pull/40420#discussion_r1331413958 ## python/pyspark/pandas/datetimes.py: ## @@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]: # type: ignore[no-untyped-def def nanosecond(self) -> "ps.Series": raise NotImplementedError() -# TODO(SPARK-42617): Support isocalendar.week and replace it. -# See also https://github.com/pandas-dev/pandas/pull/33595. -@property -def week(self) -> "ps.Series": +def isocalendar(self) -> "ps.DataFrame": """ -The week ordinal of the year. +Calculate year, week, and day according to the ISO 8601 standard. -.. deprecated:: 3.4.0 -""" -warnings.warn( -"weekofyear and week have been deprecated.", -FutureWarning, -) -return self._data.spark.transform(lambda c: F.weekofyear(c).cast(LongType())) +.. versionadded:: 4.0.0 -@property -def weekofyear(self) -> "ps.Series": -return self.week +Returns +--- +DataFrame +With columns year, week and day. -weekofyear.__doc__ = week.__doc__ +.. note:: Returns have int64 type instead of UInt32 as is in pandas due to UInt32 +is not supported by spark + +Examples + Review Comment: If you give me 1-2 days I will try to resolve the proble ## python/pyspark/pandas/datetimes.py: ## @@ -116,26 +117,59 @@ def pandas_microsecond(s) -> ps.Series[np.int32]: # type: ignore[no-untyped-def def nanosecond(self) -> "ps.Series": raise NotImplementedError() -# TODO(SPARK-42617): Support isocalendar.week and replace it. -# See also https://github.com/pandas-dev/pandas/pull/33595. -@property -def week(self) -> "ps.Series": +def isocalendar(self) -> "ps.DataFrame": """ -The week ordinal of the year. +Calculate year, week, and day according to the ISO 8601 standard. -.. deprecated:: 3.4.0 -""" -warnings.warn( -"weekofyear and week have been deprecated.", -FutureWarning, -) -return self._data.spark.transform(lambda c: F.weekofyear(c).cast(LongType())) +.. versionadded:: 4.0.0 -@property -def weekofyear(self) -> "ps.Series": -return self.week +Returns +--- +DataFrame +With columns year, week and day. -weekofyear.__doc__ = week.__doc__ +.. note:: Returns have int64 type instead of UInt32 as is in pandas due to UInt32 +is not supported by spark + +Examples + Review Comment: If you give me 1-2 days I will try to resolve the problem -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: The size of the ErrorInfo exceeds the header limit sometimes in tests. So disable this by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331243346 ## connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala: ## @@ -93,33 +179,65 @@ private[client] object GrpcExceptionConverter extends JsonUtils { new SparkArrayIndexOutOfBoundsException(message)), errorConstructor[DateTimeException]((message, _) => new SparkDateTimeException(message)), errorConstructor((message, cause) => new SparkRuntimeException(message, cause)), -errorConstructor((message, cause) => new SparkUpgradeException(message, cause))) - - private def errorInfoToThrowable(info: ErrorInfo, message: String): Option[Throwable] = { -val classes = - mapper.readValue(info.getMetadataOrDefault("classes", "[]"), classOf[Array[String]]) +errorConstructor((message, cause) => new SparkUpgradeException(message, cause)), +errorConstructor((message, cause) => new SparkException(message, cause.orNull))) + + /** + * errorsToThrowable reconstructs the exception based on a list of protobuf messages + * FetchErrorDetailsResponse.Error with un-truncated error messages and server-side stacktrace + * (if set). + */ + private def errorsToThrowable( + errorIdx: Int, + errors: Seq[FetchErrorDetailsResponse.Error]): Throwable = { + +val error = errors(errorIdx) + +val classHierarchy = error.getErrorTypeHierarchyList.asScala + +val constructor = + classHierarchy +.flatMap(errorFactory.get) +.headOption +.getOrElse((message: String, cause: Option[Throwable]) => + new SparkException( Review Comment: Will change. The message pattern is from https://github.com/apache/spark/pull/42377#discussion_r1329224205. But this one seems to be simpler -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
peter-toth commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331110271 ## sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -708,7 +708,7 @@ private[sql] object RelationalGroupedDataset { case expr: NamedExpression => expr case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] => UnresolvedAlias(a, Some(Column.generateAlias)) -case u: UnresolvedFunction => UnresolvedAlias(expr, None) +case e if !e.resolved => UnresolvedAlias(expr, None) Review Comment: I changed the `AggregateExpression` path to `case a: AggregateExpression => UnresolvedAlias(a, Some(Column.generateAlias))` but still kept the `case _ if !expr.resolved => UnresolvedAlias(expr, None)` because in the `plus_one(sum_udf(col("v1")))` case an unresolved `PythonUDF` is passed here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #42996: [SPARK-45224][PYTHON] Add examples w/ map and array as parameters of `sql()`
MaxGekk commented on PR #42996: URL: https://github.com/apache/spark/pull/42996#issuecomment-1727192557 Merging to master. Thank you, @HyukjinKwon and @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
peter-toth commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331113924 ## python/pyspark/sql/column.py: ## @@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column": >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", "d"]) >>> df.select(df.l[slice(1, 3)], df.d['key']).show() -+--+--+ -|substring(l, 1, 3)|d[key]| -+--+--+ -| abc| value| -+--+--+ ++---+--+ +|substr(l, 1, 3)|d[key]| Review Comment: I tried `self.substring` (see https://github.com/apache/spark/pull/42864/commits/d611fd5a8da8f0f24d54c3933b3cf7e4dbdabe2c) but there is no `self.substring` so if column name doesn't matter then let's use `self.substr`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hdaikoku commented on pull request #42426: [SPARK-44756][CORE] Executor hangs when RetryingBlockTransferor fails to initiate retry
hdaikoku commented on PR #42426: URL: https://github.com/apache/spark/pull/42426#issuecomment-1727835491 > I think `SparkUncaughtExceptionHandler` should caught this OOM exception and abort the executor. I'm not sure if I'm following this. For this particular case, OOM was actually caught at [`AbstractChannelHandlerContext.invokeExceptionCaught`](https://github.com/netty/netty/blob/netty-4.1.74.Final/transport/src/main/java/io/netty/channel/AbstractChannelHandlerContext.java#L304). So it was not an uncaught exception. Also, if I understand correctly from the previous discussion in https://github.com/apache/spark/pull/37779, the problem of `submit()` is that it swallows uncaught exceptions that have been thrown from _inside_ a spawned thread, while, here, the OOM was thrown within `submit()` itself even before a thread was spawned. > ``` >at java.lang.Thread.start0(Native Method) >at java.lang.Thread.start(Thread.java:719) >at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) >at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) >at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
peter-toth commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331113924 ## python/pyspark/sql/column.py: ## @@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column": >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", "d"]) >>> df.select(df.l[slice(1, 3)], df.d['key']).show() -+--+--+ -|substring(l, 1, 3)|d[key]| -+--+--+ -| abc| value| -+--+--+ ++---+--+ +|substr(l, 1, 3)|d[key]| Review Comment: I tried `self.substring` (see https://github.com/apache/spark/pull/42864/commits/d611fd5a8da8f0f24d54c3933b3cf7e4dbdabe2c) but there is no `self.substring` so reverted it in https://github.com/apache/spark/pull/42864/commits/6bd03d11b4d345ef0d571976ab80a99002a7ca75. If column name doesn't matter then let's use `self.substr`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #43012: [SPARK-45234][PYTHON][DOCS] Refine DocString of `regr_*` functions
zhengruifeng opened a new pull request, #43012: URL: https://github.com/apache/spark/pull/43012 ### What changes were proposed in this pull request? Refine DocString of `regr_*` functions ### Why are the changes needed? fix the wildcard import ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331693297 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: The change is rebased already. I reenabled the pyspark jvm flag for tests but disable this flag in a specific test case instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cxzl25 commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution
cxzl25 commented on code in PR #42199: URL: https://github.com/apache/spark/pull/42199#discussion_r1331640480 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -77,6 +79,11 @@ object SQLExecution { } val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong executionIdToQueryExecution.put(executionId, queryExecution) +val originalInterruptOnCancel = sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL) +if (originalInterruptOnCancel == null) { + val interruptOnCancel = sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL) + sc.setInterruptOnCancel(interruptOnCancel) Review Comment: If there is a problem with the interrupt task, we've encountered these two problems in our production environment before. HDFS-10468. HDFS read ends up ignoring an interrupt https://issues.apache.org/jira/browse/HDFS-10468 HDFS-10508. DFSInputStream should set thread's interrupt status after catching InterruptException from sleep https://issues.apache.org/jira/browse/HDFS-10508 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #42908: [SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer
LuciferYang commented on PR #42908: URL: https://github.com/apache/spark/pull/42908#issuecomment-1727552095 Thanks @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331251817 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: It is disabled by default in python connect tests already: https://github.com/apache/spark/blob/c89221b02bb3000f707a31322e6d40b561e527bd/python/pyspark/testing/connectutils.py#L173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wankunde opened a new pull request, #43009: [SPARK-45230][SQL] Plan sorter for Aggregate after SMJ
wankunde opened a new pull request, #43009: URL: https://github.com/apache/spark/pull/43009 ### What changes were proposed in this pull request? This PR could be a followup of https://github.com/apache/spark/pull/42488 and https://github.com/apache/spark/pull/42557. If there is an aggregate operator after the SMJ and the grouping expressions of aggregate operator contain all the join keys of the streamed side, we can add a sorter on the streamed side of the SMJ, so that the aggregate can be convert to a SortAggregate which will be faster than HashAggregate. For example, with table t1(a, b, c) and t2(x, y, z): ``` SELECT a, b, sum(c) FROM t1 JOIN t2 ON t1.b = t2.y GROUP BY a, b ``` The optimized plan: ``` Scan(t1)Scan(t2) | | | | Exchange 1 Exchange 2 \ / \ / \ / SMJ (t1.b = t2.y) | | Aggregate ``` Before this PR, spark EnsureReqirement will add Sorter(t1.b) to the left side of SMJ. ``` Scan(t1)Scan(t2) | | | | Exchange 1 Exchange 2 \ / Sort(t1.b)Sort(t2.y) \ / SMJ (t1.b = t2.y) | | HashAggregate ``` If we add a Sort(t1.b, t1.a) to the left side of the SMJ, the following aggregate could be convert to SortAggregate, will be faster. ``` Scan(t1)Scan(t2) | | | | Exchange 1 Exchange 2 \/ Sort(t1.b, t1.a) Sort(t2.y) \ / SMJ (t1.b = t2.y) | | SortAggregate ``` ### Why are the changes needed? Optimize HashAggregate ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn opened a new pull request, #43016: [SPARK-45077][UI][FOLLOWUP] Update comment to link the forked repo yaooqinn/dagre-d3
yaooqinn opened a new pull request, #43016: URL: https://github.com/apache/spark/pull/43016 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references
zhengruifeng commented on code in PR #43011: URL: https://github.com/apache/spark/pull/43011#discussion_r1331200011 ## sql/gen-sql-functions-docs.py: ## @@ -34,6 +34,8 @@ "math_funcs", "conditional_funcs", "generator_funcs", "predicate_funcs", "string_funcs", "misc_funcs", "bitwise_funcs", "conversion_funcs", "csv_funcs", +"xml_funcs", "lambda_funcs", "collection_funcs", +"url_funcs", "hash_funcs", Review Comment: manually check with ``` ag --scala 'group = \"' sql ``` all groups but `table_funcs` should be here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42864: [WIP][SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
cloud-fan commented on code in PR #42864: URL: https://github.com/apache/spark/pull/42864#discussion_r1331077927 ## python/pyspark/sql/column.py: ## @@ -712,11 +712,11 @@ def __getitem__(self, k: Any) -> "Column": >>> df = spark.createDataFrame([('abcedfg', {"key": "value"})], ["l", "d"]) >>> df.select(df.l[slice(1, 3)], df.d['key']).show() -+--+--+ -|substring(l, 1, 3)|d[key]| -+--+--+ -| abc| value| -+--+--+ ++---+--+ +|substr(l, 1, 3)|d[key]| Review Comment: Ping @peter-toth -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #42994: [SPARK-43433][PS] Match `GroupBy.nth` behavior to the latest Pandas
itholic commented on code in PR #42994: URL: https://github.com/apache/spark/pull/42994#discussion_r1331072003 ## python/pyspark/pandas/groupby.py: ## @@ -1155,14 +1152,32 @@ def nth(self, n: int) -> FrameLike: else: sdf = sdf.select(*groupkey_names).distinct() -internal = internal.copy( +agg_columns = [] +if not self._agg_columns_selected: Review Comment: `groupkeys` is going to be a data column instead of index when it's not included by `agg_columns_selected`. And index should be kept without updating from Pandas 2.0.0 for `nth`. Therefore, we should include the `groupkeys` into agg_columns to make it as a data column and create a new `InternalFrame` manually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk opened a new pull request, #43014: [WIP][CONNECT][PYTHON] Support map and array parameters by `sql()`
MaxGekk opened a new pull request, #43014: URL: https://github.com/apache/spark/pull/43014 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: The size of the ErrorInfo exceeds the header limit sometimes in tests. So disable this by default. It should not be an issue after we migrate to use FetchErrorDetails RPC -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #42996: [SPARK-45224][PYTHON] Add examples w/ map and array as parameters of `sql()`
MaxGekk closed pull request #42996: [SPARK-45224][PYTHON] Add examples w/ map and array as parameters of `sql()` URL: https://github.com/apache/spark/pull/42996 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references
zhengruifeng commented on code in PR #43011: URL: https://github.com/apache/spark/pull/43011#discussion_r1331213799 ## sql/gen-sql-functions-docs.py: ## @@ -34,6 +34,8 @@ "math_funcs", "conditional_funcs", "generator_funcs", "predicate_funcs", "string_funcs", "misc_funcs", "bitwise_funcs", "conversion_funcs", "csv_funcs", +"xml_funcs", "lambda_funcs", "collection_funcs", +"url_funcs", "hash_funcs", "struct_funcs", Review Comment: check against https://github.com/apache/spark/blob/37ab190dc5bfa59b4e06af9551c35ab179a05733/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java#L43-L48 two difference: 1, `table_funcs`: not support in `gen-sql-functions-docs.py`; 2, `binary_funcs`: we don't have function using this group; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution
cloud-fan commented on code in PR #42199: URL: https://github.com/apache/spark/pull/42199#discussion_r1331684745 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -77,6 +79,11 @@ object SQLExecution { } val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong executionIdToQueryExecution.put(executionId, queryExecution) +val originalInterruptOnCancel = sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL) +if (originalInterruptOnCancel == null) { + val interruptOnCancel = sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL) + sc.setInterruptOnCancel(interruptOnCancel) Review Comment: > does this PR just leverage a feature that SC already have? Yes, the problem is, now we enable this feature by default, while this feature can lead to memory leak. That's why I asked, how does thriftserver leverage this feature? After this PR, all queries will use this feature, not only thriftserver. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support
LuciferYang commented on PR #43005: URL: https://github.com/apache/spark/pull/43005#issuecomment-1727201913 wait https://github.com/apache/spark/pull/43008 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #43007: [SPARK-45229][CORE][UI] Show the number of drivers waiting in SUBMITTED status in MasterPage
dongjoon-hyun closed pull request #43007: [SPARK-45229][CORE][UI] Show the number of drivers waiting in SUBMITTED status in MasterPage URL: https://github.com/apache/spark/pull/43007 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #43008: [SPARK-44113][BUILD][INFRA][DOCS] Drop support for Scala 2.12
LuciferYang commented on PR #43008: URL: https://github.com/apache/spark/pull/43008#issuecomment-1727210378 cc @dongjoon-hyun FYI Do we need to further split this PR ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #42971: [SPARK-43979][SQL][FOLLOWUP] Handle non alias-only project case
cloud-fan commented on PR #42971: URL: https://github.com/apache/spark/pull/42971#issuecomment-1727363616 The failed streaming test is unrelated, I'm merging it to master/3.5, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: The size of the ErrorInfo is about to exceed the header limit even in tests. So disable this by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #43013: [MINOR][DOCS][CONNECT] Update notes about supported modules in PySpark API reference
zhengruifeng commented on PR #43013: URL: https://github.com/apache/spark/pull/43013#issuecomment-1727396313 `/mllib` in scala, `pyspark.ml` and `pyspark.mllib` in python, don't work on connect. only new module `pyspark.ml.connect` works on connect. `pyspark.ml` contains many classification/clustering/etc, they are not supported in `pyspark.ml.connect` cc @WeichenXu123 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #43005: [WIP][SPARK-44112][BUILD][INFRA] Drop Java 8 and 11 support
LuciferYang commented on code in PR #43005: URL: https://github.com/apache/spark/pull/43005#discussion_r1331245458 ## .github/workflows/build_coverage.yml: ## @@ -17,7 +17,7 @@ # under the License. # -name: "Build / Coverage (master, Scala 2.12, Hadoop 3, JDK 8)" +name: "Build / Coverage (master, Scala 2.12, Hadoop 17, JDK 8)" Review Comment: good catch ~ thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs
cloud-fan commented on PR #42997: URL: https://github.com/apache/spark/pull/42997#issuecomment-1727800532 cc @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331693297 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: The change is rebased already. I reenabled the pyspark jvm stacktrace flag for tests but disable this flag in a specific test case instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong opened a new pull request, #43017: [SPARK-45239] Reduce default spark.connect.jvmStacktrace.maxSize
heyihong opened a new pull request, #43017: URL: https://github.com/apache/spark/pull/43017 ### What changes were proposed in this pull request? - Reduce default spark.connect.jvmStacktrace.maxSize ### Why are the changes needed? - `spark.sql.pyspark.jvmStacktrace.enabled` is partially broken (i.e. hitting the 8K header limit) and has to be disable in some tests ### Does this PR introduce _any_ user-facing change? - No ### How was this patch tested? - Existing tests ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331243346 ## connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/GrpcExceptionConverter.scala: ## @@ -93,33 +179,65 @@ private[client] object GrpcExceptionConverter extends JsonUtils { new SparkArrayIndexOutOfBoundsException(message)), errorConstructor[DateTimeException]((message, _) => new SparkDateTimeException(message)), errorConstructor((message, cause) => new SparkRuntimeException(message, cause)), -errorConstructor((message, cause) => new SparkUpgradeException(message, cause))) - - private def errorInfoToThrowable(info: ErrorInfo, message: String): Option[Throwable] = { -val classes = - mapper.readValue(info.getMetadataOrDefault("classes", "[]"), classOf[Array[String]]) +errorConstructor((message, cause) => new SparkUpgradeException(message, cause)), +errorConstructor((message, cause) => new SparkException(message, cause.orNull))) + + /** + * errorsToThrowable reconstructs the exception based on a list of protobuf messages + * FetchErrorDetailsResponse.Error with un-truncated error messages and server-side stacktrace + * (if set). + */ + private def errorsToThrowable( + errorIdx: Int, + errors: Seq[FetchErrorDetailsResponse.Error]): Throwable = { + +val error = errors(errorIdx) + +val classHierarchy = error.getErrorTypeHierarchyList.asScala + +val constructor = + classHierarchy +.flatMap(errorFactory.get) +.headOption +.getOrElse((message: String, cause: Option[Throwable]) => + new SparkException( Review Comment: This is from https://github.com/apache/spark/pull/42377#discussion_r1329224205. But this one seems to be simpler -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] heyihong commented on a diff in pull request #42987: [SPARK-45207][SQL][CONNECT] Implement Complete Error Reconstruction for Scala Client
heyihong commented on code in PR #42987: URL: https://github.com/apache/spark/pull/42987#discussion_r1331247912 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -2882,8 +2882,7 @@ object SQLConf { "level settings.") .version("3.0.0") .booleanConf - // show full stacktrace in tests but hide in production by default. - .createWithDefault(Utils.isTesting) Review Comment: The size of the ErrorInfo exceeds the header limit sometimes in the tests I added. So disable this by default. It should not be an issue after we migrate to use FetchErrorDetails RPC -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
cloud-fan commented on PR #42864: URL: https://github.com/apache/spark/pull/42864#issuecomment-1727813318 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hdaikoku commented on a diff in pull request #42426: [SPARK-44756][CORE] Executor hangs when RetryingBlockTransferor fails to initiate retry
hdaikoku commented on code in PR #42426: URL: https://github.com/apache/spark/pull/42426#discussion_r1331742974 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java: ## @@ -274,7 +287,13 @@ private void handleBlockTransferFailure(String blockId, Throwable exception) { synchronized (RetryingBlockTransferor.this) { if (this == currentListener && outstandingBlocksIds.contains(blockId)) { if (shouldRetry(exception)) { -initiateRetry(exception); +try { + initiateRetry(exception); +} catch (Throwable t) { + logger.error("Exception while trying to initiate retry", t); Review Comment: Thank you for the suggestion, refactored accordingly: [32498ff](https://github.com/apache/spark/pull/42426/commits/32498ff9a906a21746b619f1ddd316777d188417) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on PR #42949: URL: https://github.com/apache/spark/pull/42949#issuecomment-1727917945 Updated fork's master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331785918 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: I don't think we can do something like that here. We are not having server-side error response needed to process. We have purely client-side error of exception being thrown while grpc consumes iterator of requests. It's not possible to extract any information on outside of grpc being called since it suppresses error entirely. I think the only option to make proper exception classes is to preload all the artifacts into memory and construct corresponding requests before streaming them into grpc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
cloud-fan closed pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions URL: https://github.com/apache/spark/pull/42864 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
allisonwang-db commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331894962 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: Could you be more specific on what `Exception` can be thrown here? From your PR description, it looks like a GPRC error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a diff in pull request #42199: [SPARK-44579][SQL] Support Interrupt On Cancel in SQLExecution
yaooqinn commented on code in PR #42199: URL: https://github.com/apache/spark/pull/42199#discussion_r1331832544 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala: ## @@ -77,6 +79,11 @@ object SQLExecution { } val rootExecutionId = sc.getLocalProperty(EXECUTION_ROOT_ID_KEY).toLong executionIdToQueryExecution.put(executionId, queryExecution) +val originalInterruptOnCancel = sc.getLocalProperty(SPARK_JOB_INTERRUPT_ON_CANCEL) +if (originalInterruptOnCancel == null) { + val interruptOnCancel = sparkSession.conf.get(SQLConf.INTERRUPT_ON_CANCEL) + sc.setInterruptOnCancel(interruptOnCancel) Review Comment: STS call cancelJobGroup after query failed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`
dongjoon-hyun commented on PR #43015: URL: https://github.com/apache/spark/pull/43015#issuecomment-1727988603 Merged to master/3.5/3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #43018: [SPARK-45241][INFRA] Use Zulu JDK in java-other-versions pipeline and Java 21
dongjoon-hyun opened a new pull request, #43018: URL: https://github.com/apache/spark/pull/43018 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB`
dongjoon-hyun closed pull request #43015: [SPARK-45237][DOCS] Change the default value of `spark.history.store.hybridStore.diskBackend` in `monitoring.md` to `ROCKSDB` URL: https://github.com/apache/spark/pull/43015 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #42864: [SPARK-45112][SQL] Use UnresolvedFunction based resolution in SQL Dataset functions
peter-toth commented on PR #42864: URL: https://github.com/apache/spark/pull/42864#issuecomment-1728048096 Thanks for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331901152 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: In my example I'm intercepting FileNotFound error -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331900167 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: > Could you be more specific on what Exception can be thrown here? From your PR description, it looks like a GPRC error. Doesn't matter, any exception thrown here would be suppressed by grpc and only generic exception "Fail to iterate response" will be visible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331901403 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: > What's the behavior after your change when the SPARK_CONNECT logger is enabled? described in pr ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: > What's the behavior after your change when the SPARK_CONNECT logger is enabled? described in pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
allisonwang-db commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331899412 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: I just think we shouldn't leave users this error message, when they don't enable the spark connect logger: ``` status = StatusCode.UNKNOWN details = "Exception iterating requests!" debug_error_string = "None" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
allisonwang-db commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331897585 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: What's the behavior after your change when the SPARK_CONNECT logger is enabled? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331900167 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: > Could you be more specific on what Exception can be thrown here? From your PR description, it looks like a GPRC error. Doesn't matter, any exception thrown here would be suppressed by grpc and only generic exception "Exception iterating requests" will be visible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #43014: [SPARK-45235][CONNECT][PYTHON] Support map and array parameters by `sql()`
allisonwang-db commented on code in PR #43014: URL: https://github.com/apache/spark/pull/43014#discussion_r1331920397 ## python/pyspark/sql/tests/connect/test_connect_basic.py: ## @@ -1237,13 +1237,23 @@ def test_sql(self): self.assertEqual(1, len(pdf.index)) def test_sql_with_named_args(self): -df = self.connect.sql("SELECT * FROM range(10) WHERE id > :minId", args={"minId": 7}) -df2 = self.spark.sql("SELECT * FROM range(10) WHERE id > :minId", args={"minId": 7}) +from pyspark.sql.functions import create_map, lit +from pyspark.sql.connect.functions import lit as clit +from pyspark.sql.connect.functions import create_map as ccreate_map Review Comment: @zhengruifeng just wondering should `SF` and `CF` be capitalized here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #43011: [WIP][SPARK-45232][DOC] Add missing function groups to SQL references
allisonwang-db commented on code in PR #43011: URL: https://github.com/apache/spark/pull/43011#discussion_r1331925178 ## sql/gen-sql-functions-docs.py: ## @@ -34,6 +34,8 @@ "math_funcs", "conditional_funcs", "generator_funcs", "predicate_funcs", "string_funcs", "misc_funcs", "bitwise_funcs", "conversion_funcs", "csv_funcs", +"xml_funcs", "lambda_funcs", "collection_funcs", +"url_funcs", "hash_funcs", "struct_funcs", Review Comment: QQ: For generator_funcs, do we have documentation for them when used in the FROM clause of a query? Functions like explode are typically considered table-valued generator functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cdkrot commented on a diff in pull request #42949: [SPARK-45093][CONNECT][PYTHON] Error reporting for addArtifacts query
cdkrot commented on code in PR #42949: URL: https://github.com/apache/spark/pull/42949#discussion_r1331929198 ## python/pyspark/sql/connect/client/artifact.py: ## @@ -243,11 +244,15 @@ def _create_requests( self, *path: str, pyfile: bool, archive: bool, file: bool ) -> Iterator[proto.AddArtifactsRequest]: """Separated for the testing purpose.""" -return self._add_artifacts( -chain( -*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +try: +yield from self._add_artifacts( +chain( +*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, file=file) for p in path) +) ) -) +except Exception as e: +logger.error(f"Failed to submit addArtifacts request: {e}") Review Comment: >> I just think we shouldn't leave users this error message, when they don't enable the spark connect logger: We can wrap the outside exception in some SparkException saying that AddArtifacts failed. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db opened a new pull request, #43019: [SPARK-45219][PYTHON][DOCS] Refine docstring of withColumn(s)Renamed
allisonwang-db opened a new pull request, #43019: URL: https://github.com/apache/spark/pull/43019 ### What changes were proposed in this pull request? This PR refines the docstring of `DataFrame.withColumnRenamed` and `DataFrame.withColumnsRenamed`. ### Why are the changes needed? To improve PySpark documentations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #42997: [SPARK-45216][SQL] Fix non-deterministic seeded Dataset APIs
allisonwang-db commented on code in PR #42997: URL: https://github.com/apache/spark/pull/42997#discussion_r1331943027 ## python/pyspark/sql/connect/functions.py: ## @@ -388,7 +390,7 @@ def rand(seed: Optional[int] = None) -> Column: if seed is not None: return _invoke_function("rand", lit(seed)) else: -return _invoke_function("rand") +return _invoke_function("rand", lit(random.randint(0, sys.maxsize))) Review Comment: These are spark connect functions. Do we also need to update `python/pyspark/sql/functions.py`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org