[GitHub] [spark] LuciferYang commented on a diff in pull request #40510: [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel
LuciferYang commented on code in PR #40510: URL: https://github.com/apache/spark/pull/40510#discussion_r1144269110 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -54,6 +54,20 @@ message UserContext { repeated google.protobuf.Any extensions = 999; } +// StorageLevel for persisting Datasets/Tables. +message StorageLevel { Review Comment: friendly ping @ueshin @HyukjinKwon When I try to import `import "spark/connect/base.proto";` in `catalog.proto` to reuse `StorageLevel` for `message CacheTable`, I found the following compile messages: ``` spark/connect/base.proto:23:1: File recursively imports itself: spark/connect/base.proto -> spark/connect/commands.proto -> spark/connect/relations.proto -> spark/connect/catalog.proto -> spark/connect/base.proto spark/connect/catalog.proto:22:1: Import "spark/connect/base.proto" was not found or had errors. spark/connect/catalog.proto:144:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/catalog.proto:161:12: "spark.connect.DataType" seems to be defined in "spark/connect/types.proto", which is not imported by "spark/connect/catalog.proto". To use it here, please add the necessary import. spark/connect/relations.proto:25:1: Import "spark/connect/catalog.proto" was not found or had errors. spark/connect/relations.proto:84:5: "Catalog" is not defined. spark/connect/commands.proto:22:1: Import "spark/connect/relations.proto" was not found or had errors. spark/connect/commands.proto:63:3: "Relation" is not defined. spark/connect/commands.proto:81:3: "Relation" is not defined. spark/connect/commands.proto:142:3: "Relation" is not defined. spark/connect/base.proto:23:1: Import "spark/connect/commands.proto" was not found or had errors. spark/connect/base.proto:25:1: Import "spark/connect/relations.proto" was not found or had errors. ``` So should we move `StorageLevel` to a more suitable place? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client
LuciferYang commented on code in PR #40516: URL: https://github.com/apache/spark/pull/40516#discussion_r1144250578 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2771,22 +2771,86 @@ class Dataset[T] private[sql] ( new DataFrameWriterV2[T](table, this) } + /** + * Persist this Dataset with the default storage level (`MEMORY_AND_DISK`). + * + * @group basic + * @since 3.4.0 Review Comment: Make `@since 3.4.0` first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Spark connect jvm client
LuciferYang commented on PR #40516: URL: https://github.com/apache/spark/pull/40516#issuecomment-1478944069 cc @HyukjinKwon @hvanhovell FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #40516: [SPARK-42894][CONNECT] Support `cache`/`persist`/`unpersist`/`storageLevel` for Scala connect client
LuciferYang opened a new pull request, #40516: URL: https://github.com/apache/spark/pull/40516 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ritikam2 commented on pull request #40116: [SPARK-41391][SQL] The output column name of groupBy.agg(count_distinct) is incorrect
ritikam2 commented on PR #40116: URL: https://github.com/apache/spark/pull/40116#issuecomment-1478932824 Perhaps the following would be better solution. Instead of looking for star any UnresolvedFunction should have UnresolvedAlias. Any comments? `private[this] def alias(expr: Expression): NamedExpression = expr match { case expr: NamedExpression => expr case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] => UnresolvedAlias(a, Some(Column.generateAlias)) case expr: Expression => if (expr.isInstanceOf[UnresolvedFunction]) { UnresolvedAlias(expr, None) } else { Alias(expr, toPrettySQL(expr))() } }` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
Stove-hust commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1478924572 @mridulm yep,it`s me Username: StoveM Full name: Fencheng Mei -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
Stove-hust commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1478924279 > yep,it`s me Username:StoveM Full name: Fencheng Mei -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #40504: [SPARK-42880][DOCS] Update running-on-yarn.md to log4j2 syntax
srowen commented on PR #40504: URL: https://github.com/apache/spark/pull/40504#issuecomment-1478906144 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #40504: [SPARK-42880][DOCS] Update running-on-yarn.md to log4j2 syntax
srowen closed pull request #40504: [SPARK-42880][DOCS] Update running-on-yarn.md to log4j2 syntax URL: https://github.com/apache/spark/pull/40504 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
mridulm commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1478905577 Is your jira id `StoveM` @Stove-hust ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType
cloud-fan commented on code in PR #40512: URL: https://github.com/apache/spark/pull/40512#discussion_r1144213535 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/DataTypeUtils.scala: ## @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.types + +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType} + +object DataTypeUtils { + /** + * Check if `this` and `other` are the same data type when ignoring nullability + * (`StructField.nullable`, `ArrayType.containsNull`, and `MapType.valueContainsNull`). + */ + def sameType(left: DataType, right: DataType): Boolean = +if (SQLConf.get.caseSensitiveAnalysis) { + equalsIgnoreNullability(left, right) +} else { + equalsIgnoreCaseAndNullability(left, right) +} + + /** + * Compares two types, ignoring nullability of ArrayType, MapType, StructType. + */ + def equalsIgnoreNullability(left: DataType, right: DataType): Boolean = { Review Comment: nvm, it was a static method before this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType
cloud-fan commented on code in PR #40512: URL: https://github.com/apache/spark/pull/40512#discussion_r1144213026 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/DataTypeUtils.scala: ## @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.catalyst.types + +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructType} + +object DataTypeUtils { + /** + * Check if `this` and `other` are the same data type when ignoring nullability + * (`StructField.nullable`, `ArrayType.containsNull`, and `MapType.valueContainsNull`). + */ + def sameType(left: DataType, right: DataType): Boolean = +if (SQLConf.get.caseSensitiveAnalysis) { + equalsIgnoreNullability(left, right) +} else { + equalsIgnoreCaseAndNullability(left, right) +} + + /** + * Compares two types, ignoring nullability of ArrayType, MapType, StructType. + */ + def equalsIgnoreNullability(left: DataType, right: DataType): Boolean = { Review Comment: does this method require any non-public APIs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #40397: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF
cloud-fan closed pull request #40397: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF URL: https://github.com/apache/spark/pull/40397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40397: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF
cloud-fan commented on PR #40397: URL: https://github.com/apache/spark/pull/40397#issuecomment-1478896479 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
beliefer commented on code in PR #40467: URL: https://github.com/apache/spark/pull/40467#discussion_r1144208941 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala: ## @@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: proto.Expression) extends Logg * @group df_ops * @since 3.4.0 */ - def explain(extended: Boolean): Unit = { + def explain(extended: Boolean)(implicit spark: SparkSession): Unit = { Review Comment: I understand what you mean roughly. We don't need follow SQL, but the output reflects the hierarchy of the tree. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40514: [SPARK-41233][CONNECT][PYTHON] Add array_prepend to Spark Connect Python client
zhengruifeng commented on PR #40514: URL: https://github.com/apache/spark/pull/40514#issuecomment-1478884594 would you mind adding a simple test here? https://github.com/apache/spark/blob/149e020a5ca88b2db9c56a9d48e0c1c896b57069/python/pyspark/sql/tests/connect/test_connect_function.py#L1077-L1089 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
hvanhovell commented on code in PR #40467: URL: https://github.com/apache/spark/pull/40467#discussion_r1144196759 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala: ## @@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: proto.Expression) extends Logg * @group df_ops * @since 3.4.0 */ - def explain(extended: Boolean): Unit = { + def explain(extended: Boolean)(implicit spark: SparkSession): Unit = { Review Comment: @beliefer no we can't be fully consistent with SQL. We are already deviating from what the original implementation does when you set `extended = true`. I just want something that is more readable. As for the implementation. The catalyst dependency is going away. Just create a bit of tree traversal logic that nicely renders the expression tree. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #40413: [SPARK-42786][Connect] Typed Select
hvanhovell closed pull request #40413: [SPARK-42786][Connect] Typed Select URL: https://github.com/apache/spark/pull/40413 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40413: [SPARK-42786][Connect] Typed Select
hvanhovell commented on PR #40413: URL: https://github.com/apache/spark/pull/40413#issuecomment-1478874829 Merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40368: [SPARK-42748][CONNECT] Server-side Artifact Management
hvanhovell commented on PR #40368: URL: https://github.com/apache/spark/pull/40368#issuecomment-1478874581 @vicennial can you update? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #40397: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF
panbingkun commented on PR #40397: URL: https://github.com/apache/spark/pull/40397#issuecomment-1478870787 @cloud-fan Can we merge it to master? After it I will try to refactor HiveGenericUDTF & HiveUDAFFunction. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell opened a new pull request, #40515: [SPARK-42884][CONNECT] Add Ammonite REPL integration
hvanhovell opened a new pull request, #40515: URL: https://github.com/apache/spark/pull/40515 ### What changes were proposed in this pull request? This PR adds Ammonite REPL integration for Spark Connect. This has a couple of benefits: - It makes it a lot less cumbersome for users to start a spark connect REPL. You don't have to add custom scripts, and you can use `coursier` to launch a fully function REPL for you. - It adds REPL integration for to the actual build. This makes it easier to validate the code we add is actually working. ### Why are the changes needed? A REPL is arguably the first entry point for a lot of users. ### Does this PR introduce _any_ user-facing change? Yes it adds REPL integration. ### How was this patch tested? Added tests for the command line parsing. Manually tested the REPL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
LuciferYang commented on PR #40496: URL: https://github.com/apache/spark/pull/40496#issuecomment-1478858139 > @dtenedor Wait a few minutes for me to check with Scala 2.13 manually done, should be ok ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
beliefer commented on code in PR #40467: URL: https://github.com/apache/spark/pull/40467#discussion_r1144182236 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala: ## @@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: proto.Expression) extends Logg * @group df_ops * @since 3.4.0 */ - def explain(extended: Boolean): Unit = { + def explain(extended: Boolean)(implicit spark: SparkSession): Unit = { Review Comment: @hvanhovell Do you mean to construct local catalyst expression tree at connect client side ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Stove-hust commented on pull request #40412: [SPARK-42784] should still create subDir when the number of subDir in merge dir is less than conf
Stove-hust commented on PR #40412: URL: https://github.com/apache/spark/pull/40412#issuecomment-1478854808 @zhouyejoe I I kept the accident scene, should be able to help you。(In our clustered machine environment,11 HDD,creating 11 * 64 subdirectories would take longer to create) 23/02/21 10:44:09 YarnAllocator: Launching container container_184 on host hostA 23/02/21 10:44:21 YarnAllocator: Launching container container_185 on host hostA 23/02/21 10:44:21 YarnAllocator: Container container_184 on host: hostA was preempted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Stove-hust commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
Stove-hust commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1478849048 > I could not cherry pick this into 3.4 and 3.3 - we should fix for those branches as well IMO. Can you create a PR against those two branches as well @Stove-hust ? Thanks No problem -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
beliefer commented on code in PR #40467: URL: https://github.com/apache/spark/pull/40467#discussion_r1144175741 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala: ## @@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: proto.Expression) extends Logg * @group df_ops * @since 3.4.0 */ - def explain(extended: Boolean): Unit = { + def explain(extended: Boolean)(implicit spark: SparkSession): Unit = { Review Comment: @hvanhovell If so, the output of `Column.explain` can't keep consistent with Spark SQL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin opened a new pull request, #40514: [SPARK-41233][CONNECT][PYTHON] Add array_prepend to Spark Connect Python client
ueshin opened a new pull request, #40514: URL: https://github.com/apache/spark/pull/40514 ### What changes were proposed in this pull request? This is a follow-up of #38947. Add `array_prepend` function to Spark Connect Python client. ### Why are the changes needed? `array_prepend` was added at #38947 without Spark Connect Python client. ### Does this PR introduce _any_ user-facing change? `array_prepend` will be available in Spark Connect Python client. ### How was this patch tested? Enabled the related test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
LuciferYang commented on PR #40496: URL: https://github.com/apache/spark/pull/40496#issuecomment-147884 @dtenedor Wait a few minutes for me to check with Scala 2.13 manually -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
dtenedor commented on PR #40496: URL: https://github.com/apache/spark/pull/40496#issuecomment-1478835534 @HyukjinKwon the tests are passing now, this is ready to merge if you are ready :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
mridulm commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1478833979 I could not cherry pick this into 3.4 and 3.3 - we should fix for those branches as well IMO. Can you create a PR against those two branches as well @Stove-hust ? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
mridulm commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1478833621 Merged to master. Thanks for working on this @Stove-hust ! Thanks for the review @otterc :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm closed pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks
mridulm closed pull request #40393: [SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks URL: https://github.com/apache/spark/pull/40393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng opened a new pull request, #40513: Block Arrow-optimized Python UDFs
xinrong-meng opened a new pull request, #40513: URL: https://github.com/apache/spark/pull/40513 ### What changes were proposed in this pull request? Block the usage of Arrow-optimized Python UDFs in Apache Spark 3.4.0. ### Why are the changes needed? Considering the upcoming improvements on the result inconsistencies between traditional Pickled Python UDFs and Arrow-optimized Python UDFs, we'd better block the feature, otherwise, users who try out the feature will expect behavior changes in the next release. ### Does this PR introduce _any_ user-facing change? Yes. Arrow-optimized Python UDFs are blocked. ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40467: [SPARK-42584][CONNECT] Improve output of `Column.explain`
hvanhovell commented on code in PR #40467: URL: https://github.com/apache/spark/pull/40467#discussion_r1144160497 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Column.scala: ## @@ -1211,13 +1211,11 @@ class Column private[sql] (private[sql] val expr: proto.Expression) extends Logg * @group df_ops * @since 3.4.0 */ - def explain(extended: Boolean): Unit = { + def explain(extended: Boolean)(implicit spark: SparkSession): Unit = { Review Comment: I am not sure what happened here, my comment seemed to have gone missing... Anyway, please don't send an RPC for this. Just use the local expression tree and create something that is a little bit easier to read. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40489: [SPARK-42871][BUILD] Upgrade slf4j to 2.0.7
LuciferYang commented on PR #40489: URL: https://github.com/apache/spark/pull/40489#issuecomment-1478824499 Thanks @srowen @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40490: [SPARK-42536][BUILD] Upgrade log4j2 to 2.20.0
LuciferYang commented on PR #40490: URL: https://github.com/apache/spark/pull/40490#issuecomment-1478821629 Thanks @dongjoon-hyun @viirya ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] frankliee commented on a diff in pull request #40504: [SPARK-42880][DOCS] Update running-on-yarn.md to log4j2 syntax
frankliee commented on code in PR #40504: URL: https://github.com/apache/spark/pull/40504#discussion_r1144154895 ## docs/running-on-yarn.md: ## @@ -137,7 +137,7 @@ Note that for the first option, both executors and the application master will s log4j configuration, which may cause issues when they run on the same node (e.g. trying to write to the same log file). -If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use `spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, `log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log`. For streaming applications, configuring `RollingFileAppender` and setting file location to YARN's log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN's log utility. +If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use `spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, `appender.spark.fileName=${sys:spark.yarn.app.container.log.dir}/spark.log`. For streaming applications, configuring `RollingFileAppender` and setting file location to YARN's log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN's log utility. Review Comment: I have changed back to `file_appender`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] frankliee commented on a diff in pull request #40504: [SPARK-42880][DOCS] Update running-on-yarn.md to log4j2 syntax
frankliee commented on code in PR #40504: URL: https://github.com/apache/spark/pull/40504#discussion_r1144154348 ## docs/running-on-yarn.md: ## @@ -137,7 +137,7 @@ Note that for the first option, both executors and the application master will s log4j configuration, which may cause issues when they run on the same node (e.g. trying to write to the same log file). -If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use `spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, `log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log`. For streaming applications, configuring `RollingFileAppender` and setting file location to YARN's log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN's log utility. +If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use `spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, `appender.spark.fileName=${sys:spark.yarn.app.container.log.dir}/spark.log`. For streaming applications, configuring `RollingFileAppender` and setting file location to YARN's log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN's log utility. Review Comment: updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType
amaliujia commented on PR #40512: URL: https://github.com/apache/spark/pull/40512#issuecomment-1478801856 @hvanhovell @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #40512: [SPARK-42892][SQL] Move sameType and relevant methods out of DataType
amaliujia opened a new pull request, #40512: URL: https://github.com/apache/spark/pull/40512 ### What changes were proposed in this pull request? This PR moves the following methods from `DataType`: 1. equalsIgnoreNullability 2. sameType 3. equalsIgnoreCaseAndNullability The moved class is put together into a Util class. ### Why are the changes needed? To make `DataType` become a simpler interface, non-public methods can be moved outside of the DataType class. ### Does this PR introduce _any_ user-facing change? No as the moved methods are private within Spark. ### How was this patch tested? Existing UT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
cloud-fan closed pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes URL: https://github.com/apache/spark/pull/40385 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
cloud-fan commented on PR #40385: URL: https://github.com/apache/spark/pull/40385#issuecomment-1478798727 thanks, merging to master/3.4! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40034: [SPARK-42447][INFRA] Remove Hadoop 2 GitHub Action job
HyukjinKwon commented on PR #40034: URL: https://github.com/apache/spark/pull/40034#issuecomment-1478796828 I remember Guava upgrade is also blocked by Hive .. IIRC .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40494: [MINOR][DOCS] Fix typos
HyukjinKwon closed pull request #40494: [MINOR][DOCS] Fix typos URL: https://github.com/apache/spark/pull/40494 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40494: [MINOR][DOCS] Fix typos
HyukjinKwon commented on PR #40494: URL: https://github.com/apache/spark/pull/40494#issuecomment-1478792526 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40510: [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel
HyukjinKwon closed pull request #40510: [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel URL: https://github.com/apache/spark/pull/40510 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40510: [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel
HyukjinKwon commented on PR #40510: URL: https://github.com/apache/spark/pull/40510#issuecomment-1478790850 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40447: [SPARK-42816][CONNECT] Support Max Message size up to 128MB
dongjoon-hyun commented on PR #40447: URL: https://github.com/apache/spark/pull/40447#issuecomment-1478787974 Merged to master/branch-3.4. Thank you, @grundprinzip and all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #40447: [SPARK-42816][CONNECT] Support Max Message size up to 128MB
dongjoon-hyun closed pull request #40447: [SPARK-42816][CONNECT] Support Max Message size up to 128MB URL: https://github.com/apache/spark/pull/40447 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #40511: [SPARK-42888][BUILD] Upgrade `gcs-connector` to 2.2.11
sunchao commented on PR #40511: URL: https://github.com/apache/spark/pull/40511#issuecomment-1478787651 LGTM too, thanks @cnauroth @dongjoon-hyun ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sudoliyang commented on pull request #40494: [MINOR][DOCS] Fix typos
sudoliyang commented on PR #40494: URL: https://github.com/apache/spark/pull/40494#issuecomment-1478786416 @HyukjinKwon Thanks. I ran the tests at https://github.com/sudoliyang/spark/pull/2. I would like to rebase and force push again to run all tests here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40511: [SPARK-42888][BUILD] Upgrade `gcs-connector` to 2.2.11
dongjoon-hyun commented on PR #40511: URL: https://github.com/apache/spark/pull/40511#issuecomment-1478780794 Merged to master/3.4 for Apache Spark 3.4.0. This will be a part of next Apache Spark 3.4.0 RC. I added you to the Apache Spark contributor group and assigned SPARK-42888 to you. Welcome to the Apache Spark community, @cnauroth . Also, cc @sunchao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #40511: [SPARK-42888][BUILD] Upgrade `gcs-connector` to 2.2.11
dongjoon-hyun closed pull request #40511: [SPARK-42888][BUILD] Upgrade `gcs-connector` to 2.2.11 URL: https://github.com/apache/spark/pull/40511 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations
zhengruifeng commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1478764552 @srowen sounds reasonable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40489: [SPARK-42871][BUILD] Upgrade slf4j to 2.0.7
HyukjinKwon closed pull request #40489: [SPARK-42871][BUILD] Upgrade slf4j to 2.0.7 URL: https://github.com/apache/spark/pull/40489 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40489: [SPARK-42871][BUILD] Upgrade slf4j to 2.0.7
HyukjinKwon commented on PR #40489: URL: https://github.com/apache/spark/pull/40489#issuecomment-1478757952 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40468: [SPARK-42838][SQL] changed error class name _LEGACY_ERROR_TEMP_2000
HyukjinKwon commented on PR #40468: URL: https://github.com/apache/spark/pull/40468#issuecomment-1478757584 Mind filling the PR description as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40507: [SPARK-42662][CONNECT][PS] Add proto message for pandas API on Spark default index
HyukjinKwon closed pull request #40507: [SPARK-42662][CONNECT][PS] Add proto message for pandas API on Spark default index URL: https://github.com/apache/spark/pull/40507 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #40505: [MINOR][DOCS] Remove SparkSession constructor invocation in the example
HyukjinKwon closed pull request #40505: [MINOR][DOCS] Remove SparkSession constructor invocation in the example URL: https://github.com/apache/spark/pull/40505 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40507: [SPARK-42662][CONNECT][PS] Add proto message for pandas API on Spark default index
HyukjinKwon commented on PR #40507: URL: https://github.com/apache/spark/pull/40507#issuecomment-1478755845 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40505: [MINOR][DOCS] Remove SparkSession constructor invocation in the example
HyukjinKwon commented on code in PR #40505: URL: https://github.com/apache/spark/pull/40505#discussion_r1144108734 ## python/pyspark/sql/session.py: ## @@ -179,10 +179,15 @@ class SparkSession(SparkConversionMixin): ... .getOrCreate() ... ) -Create a Spark session from a Spark context. +Create a Spark session with Spark Connect. ->>> sc = spark.sparkContext ->>> spark = SparkSession(sc) +>>> spark = ( +... SparkSession.builder +... .remote("sc://localhost") Review Comment: That's fine without a port - it uses the default port 15002. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #40505: [MINOR][DOCS] Remove SparkSession constructor invocation in the example
HyukjinKwon commented on PR #40505: URL: https://github.com/apache/spark/pull/40505#issuecomment-1478755239 Merged to master and branch-3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #38608: [SPARK-41080][SQL] Support Bit manipulation function SETBIT
github-actions[bot] closed pull request #38608: [SPARK-41080][SQL] Support Bit manipulation function SETBIT URL: https://github.com/apache/spark/pull/38608 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive
github-actions[bot] closed pull request #38534: [SPARK-38505][SQL] Make partial aggregation adaptive URL: https://github.com/apache/spark/pull/38534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhouyejoe commented on pull request #40412: [SPARK-42784] should still create subDir when the number of subDir in merge dir is less than conf
zhouyejoe commented on PR #40412: URL: https://github.com/apache/spark/pull/40412#issuecomment-1478751685 @Stove-hust I think the changes will help resolve the issue described in the ticket. I am checking more about what could be causing to the race conditions where there are two Executor containers starting at almost the same time on the same node, and they will all enter this code section and try creating subdirs at the same time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cnauroth commented on pull request #40511: [SPARK-42888][BUILD] Upgrade GCS connector from 2.2.7 to 2.2.11.
cnauroth commented on PR #40511: URL: https://github.com/apache/spark/pull/40511#issuecomment-1478711103 @dongjoon-hyun , may I ask for your review, since you did the original import of the GCS connector in [SPARK-33605](https://issues.apache.org/jira/browse/SPARK-33605)/#37745? Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cnauroth opened a new pull request, #40511: [SPARK-42888][BUILD] Upgrade GCS connector from 2.2.7 to 2.2.11.
cnauroth opened a new pull request, #40511: URL: https://github.com/apache/spark/pull/40511 ### What changes were proposed in this pull request? Upgrade the [GCS Connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/tree/v2.2.11/gcs) bundled in the Spark distro from version 2.2.7 to 2.2.11. ### Why are the changes needed? The new release contains multiple bug fixes and enhancements discussed in the [Release Notes](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.11/gcs/CHANGES.md). Notable changes include: * Improved socket timeout handling. * Trace logging capabilities. * Fix bug that prevented usage of GCS as a [Hadoop Credential Provider](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html). * Dependency upgrades. * Support OAuth2 based client authentication. ### Does this PR introduce _any_ user-facing change? Distributions built with `-Phadoop-cloud` now include GCS connector 2.2.7 instead of 2.2.11. ``` cnauroth@cnauroth-2-1-m:~/spark-3.5.0-SNAPSHOT-bin-custom-spark$ ls -lrt jars/gcs* -rw-r--r-- 1 cnauroth cnauroth 36497606 Mar 21 00:42 jars/gcs-connector-hadoop3-2.2.11-shaded.jar ``` ### How was this patch tested? **Build** I built a custom distro with `-Phadoop-cloud`: ``` ./dev/make-distribution.sh --name custom-spark --pip --tgz -Phadoop-3 -Phadoop-cloud -Pscala-2.12 ``` **Run** I ran a PySpark job that successfully reads and writes using GCS: ``` from pyspark.sql import SparkSession def main() -> None: # Create SparkSession. spark = (SparkSession.builder .appName('copy-shakespeare') .getOrCreate()) # Read. df = spark.read.text('gs://dataproc-datasets-us-central1/shakespeare') # Write. df.write.text('gs://cnauroth-hive-metastore-proxy-dist/output/copy-shakespeare') spark.stop() if __name__ == '__main__': main() ``` Authored-by: Chris Nauroth -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options
ueshin commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1144066905 ## python/pyspark/sql/connect/plan.py: ## @@ -302,13 +302,16 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation: class Read(LogicalPlan): -def __init__(self, table_name: str) -> None: +def __init__(self, table_name: str, options: Dict[str, str] = {}) -> None: Review Comment: Please avoid `{}` as a default argument, which could cause unexpected failures. ```py def __init__(self, table_name: str, options: Optional[Dict[str, str]] = None) -> None: ... self.options = options or {} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options
ueshin commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1144066905 ## python/pyspark/sql/connect/plan.py: ## @@ -302,13 +302,16 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation: class Read(LogicalPlan): -def __init__(self, table_name: str) -> None: +def __init__(self, table_name: str, options: Dict[str, str] = {}) -> None: Review Comment: Please avoid `{}` as a default argument, which could cause unexpected failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options
ueshin commented on code in PR #40498: URL: https://github.com/apache/spark/pull/40498#discussion_r1144066373 ## connector/connect/common/src/main/protobuf/spark/connect/relations.proto: ## @@ -122,6 +122,9 @@ message Read { message NamedTable { // (Required) Unparsed identifier for the table. string unparsed_identifier = 1; + +// Options for the named table. The map key is case insensitive. +map options = 3; Review Comment: nit: Why is this id `3`? seems like `2` is still available? ## python/pyspark/sql/connect/plan.py: ## @@ -302,13 +302,16 @@ def plan(self, session: "SparkConnectClient") -> proto.Relation: class Read(LogicalPlan): -def __init__(self, table_name: str) -> None: +def __init__(self, table_name: str, options: Dict[str, str] = {}) -> None: Review Comment: Please avoid `{}` as a default argument, which could cause unexpected behaviors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin opened a new pull request, #40510: [SPARK-42889][CONNECT][PYTHON] Implement cache, persist, unpersist, and storageLevel
ueshin opened a new pull request, #40510: URL: https://github.com/apache/spark/pull/40510 ### What changes were proposed in this pull request? Implements `DataFrame.cache`, `persist`, `unpersist`, and `storageLevel`. ### Why are the changes needed? Missing APIs. ### Does this PR introduce _any_ user-facing change? `DataFrame.cache`, `persist`, `unpersist`, and `storageLevel` will be available. ### How was this patch tested? Added/enabled the related tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
amaliujia commented on PR #40496: URL: https://github.com/apache/spark/pull/40496#issuecomment-1478658342 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
dtenedor commented on code in PR #40496: URL: https://github.com/apache/spark/pull/40496#discussion_r1144040124 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala: ## @@ -59,10 +61,39 @@ trait SQLQueryTestHelper extends Logging { */ protected def getNormalizedQueryAnalysisResult( session: SparkSession, sql: String): (String, Seq[String]) = { +// Note that creating the following DataFrame includes eager execution for commands that create +// objects such as views. Therefore any following queries that reference these objects should +// find them in the catalog. val df = session.sql(sql) val schema = df.schema.catalogString -// Get the answer, but also get rid of the #1234 expression IDs that show up in analyzer plans. -(schema, Seq(replaceNotIncludedMsg(df.queryExecution.analyzed.toString))) +val analyzed = df.queryExecution.analyzed +// Determine if the analyzed plan contains any nondeterministic expressions. +var deterministic = true +analyzed.transformAllExpressionsWithSubqueries { + case expr: CurrentDate => +deterministic = false +expr + case expr: CurrentTimestampLike => +deterministic = false +expr + case expr: CurrentUser => +deterministic = false +expr + case expr: Literal if expr.dataType == DateType || expr.dataType == TimestampType => +deterministic = false +expr + case expr if !expr.deterministic => +deterministic = false +expr +} +if (deterministic) { + // Perform query analysis, but also get rid of the #1234 expression IDs that show up in the + // resolved plans. + (schema, Seq(replaceNotIncludedMsg(analyzed.toString))) +} else { + // The analyzed plan is nondeterministic so elide it from the result to keep tests reliable. + (schema, Seq("[Analyzer test output redacted due to nondeterminism]")) +} Review Comment: Oh yeah it's only used in the SQL golden test runner. It's in `SQLQueryTestHelper.scala` :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
amaliujia commented on code in PR #40496: URL: https://github.com/apache/spark/pull/40496#discussion_r1144038811 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala: ## @@ -59,10 +61,39 @@ trait SQLQueryTestHelper extends Logging { */ protected def getNormalizedQueryAnalysisResult( session: SparkSession, sql: String): (String, Seq[String]) = { +// Note that creating the following DataFrame includes eager execution for commands that create +// objects such as views. Therefore any following queries that reference these objects should +// find them in the catalog. val df = session.sql(sql) val schema = df.schema.catalogString -// Get the answer, but also get rid of the #1234 expression IDs that show up in analyzer plans. -(schema, Seq(replaceNotIncludedMsg(df.queryExecution.analyzed.toString))) +val analyzed = df.queryExecution.analyzed +// Determine if the analyzed plan contains any nondeterministic expressions. +var deterministic = true +analyzed.transformAllExpressionsWithSubqueries { + case expr: CurrentDate => +deterministic = false +expr + case expr: CurrentTimestampLike => +deterministic = false +expr + case expr: CurrentUser => +deterministic = false +expr + case expr: Literal if expr.dataType == DateType || expr.dataType == TimestampType => +deterministic = false +expr + case expr if !expr.deterministic => +deterministic = false +expr +} +if (deterministic) { + // Perform query analysis, but also get rid of the #1234 expression IDs that show up in the + // resolved plans. + (schema, Seq(replaceNotIncludedMsg(analyzed.toString))) +} else { + // The analyzed plan is nondeterministic so elide it from the result to keep tests reliable. + (schema, Seq("[Analyzer test output redacted due to nondeterminism]")) +} Review Comment: I was wondering if this code is used in non testing environment and I tried a code search and it seems only being used with test suites. Basically I was not sure in production if this part is also used then the `if else` actually changes the output. At least code search shows this is not a concern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
dtenedor commented on code in PR #40496: URL: https://github.com/apache/spark/pull/40496#discussion_r1144023834 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala: ## @@ -59,10 +61,39 @@ trait SQLQueryTestHelper extends Logging { */ protected def getNormalizedQueryAnalysisResult( session: SparkSession, sql: String): (String, Seq[String]) = { +// Note that creating the following DataFrame includes eager execution for commands that create +// objects such as views. Therefore any following queries that reference these objects should +// find them in the catalog. val df = session.sql(sql) val schema = df.schema.catalogString -// Get the answer, but also get rid of the #1234 expression IDs that show up in analyzer plans. -(schema, Seq(replaceNotIncludedMsg(df.queryExecution.analyzed.toString))) +val analyzed = df.queryExecution.analyzed +// Determine if the analyzed plan contains any nondeterministic expressions. +var deterministic = true +analyzed.transformAllExpressionsWithSubqueries { + case expr: CurrentDate => +deterministic = false +expr + case expr: CurrentTimestampLike => +deterministic = false +expr + case expr: CurrentUser => +deterministic = false +expr + case expr: Literal if expr.dataType == DateType || expr.dataType == TimestampType => +deterministic = false +expr + case expr if !expr.deterministic => +deterministic = false +expr +} +if (deterministic) { + // Perform query analysis, but also get rid of the #1234 expression IDs that show up in the + // resolved plans. + (schema, Seq(replaceNotIncludedMsg(analyzed.toString))) +} else { + // The analyzed plan is nondeterministic so elide it from the result to keep tests reliable. + (schema, Seq("[Analyzer test output redacted due to nondeterminism]")) +} Review Comment: Hi @amaliujia I am a bit confused by your question, would you mind to clarify? This redaction is for the generated analyzer test result files only, in order to prevent the tests from becoming flaky from things like `current_date()` changing each day. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
dtenedor commented on code in PR #40496: URL: https://github.com/apache/spark/pull/40496#discussion_r1144021973 ## sql/core/src/test/resources/sql-tests/analyzer-results/ansi/cast.sql.out: ## @@ -0,0 +1,881 @@ +-- Automatically generated by SQLQueryTestSuite +-- !query +SELECT CAST('1.23' AS int) +-- !query analysis +Project [cast(1.23 as int) AS CAST(1.23 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('1.23' AS long) +-- !query analysis +Project [cast(1.23 as bigint) AS CAST(1.23 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('-4.56' AS int) +-- !query analysis +Project [cast(-4.56 as int) AS CAST(-4.56 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-4.56' AS long) +-- !query analysis +Project [cast(-4.56 as bigint) AS CAST(-4.56 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS int) +-- !query analysis +Project [cast(abc as int) AS CAST(abc AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS long) +-- !query analysis +Project [cast(abc as bigint) AS CAST(abc AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS float) +-- !query analysis +Project [cast(abc as float) AS CAST(abc AS FLOAT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('abc' AS double) +-- !query analysis +Project [cast(abc as double) AS CAST(abc AS DOUBLE)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('1234567890123' AS int) +-- !query analysis +Project [cast(1234567890123 as int) AS CAST(1234567890123 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('12345678901234567890123' AS long) +-- !query analysis +Project [cast(12345678901234567890123 as bigint) AS CAST(12345678901234567890123 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS int) +-- !query analysis +Project [cast( as int) AS CAST( AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS long) +-- !query analysis +Project [cast( as bigint) AS CAST( AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS float) +-- !query analysis +Project [cast( as float) AS CAST( AS FLOAT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('' AS double) +-- !query analysis +Project [cast( as double) AS CAST( AS DOUBLE)#x] ++- OneRowRelation + + +-- !query +SELECT CAST(NULL AS int) +-- !query analysis +Project [cast(null as int) AS CAST(NULL AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST(NULL AS long) +-- !query analysis +Project [cast(null as bigint) AS CAST(NULL AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS int) +-- !query analysis +Project [cast(123.a as int) AS CAST(123.a AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS long) +-- !query analysis +Project [cast(123.a as bigint) AS CAST(123.a AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS float) +-- !query analysis +Project [cast(123.a as float) AS CAST(123.a AS FLOAT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('123.a' AS double) +-- !query analysis +Project [cast(123.a as double) AS CAST(123.a AS DOUBLE)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-2147483648' AS int) +-- !query analysis +Project [cast(-2147483648 as int) AS CAST(-2147483648 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-2147483649' AS int) +-- !query analysis +Project [cast(-2147483649 as int) AS CAST(-2147483649 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('2147483647' AS int) +-- !query analysis +Project [cast(2147483647 as int) AS CAST(2147483647 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('2147483648' AS int) +-- !query analysis +Project [cast(2147483648 as int) AS CAST(2147483648 AS INT)#x] ++- OneRowRelation + + +-- !query +SELECT CAST('-9223372036854775808' AS long) +-- !query analysis +Project [cast(-9223372036854775808 as bigint) AS CAST(-9223372036854775808 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('-9223372036854775809' AS long) +-- !query analysis +Project [cast(-9223372036854775809 as bigint) AS CAST(-9223372036854775809 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('9223372036854775807' AS long) +-- !query analysis +Project [cast(9223372036854775807 as bigint) AS CAST(9223372036854775807 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT CAST('9223372036854775808' AS long) +-- !query analysis +Project [cast(9223372036854775808 as bigint) AS CAST(9223372036854775808 AS BIGINT)#xL] ++- OneRowRelation + + +-- !query +SELECT HEX(CAST('abc' AS binary)) +-- !query analysis +Project [hex(cast(abc as binary)) AS hex(CAST(abc AS BINARY))#x] ++- OneRowRelation + + +-- !query +SELECT HEX(CAST(CAST(123 AS byte) AS binary)) +-- !query analysis +org.apache.spark.sql.AnalysisException +{ + "errorClass" : "DATATYPE_MISMATCH.CAST_WITH_CONF_SUGGESTION", + "sqlState" : "42K09", + "messageParameters" : { +"config" : "\"spark.sql.ansi.enabled\"", +"configVal" : "'false'", +"sqlExpr" : "\"CAST(CAST(123 AS TINYINT) AS
[GitHub] [spark] amaliujia commented on a diff in pull request #40496: [SPARK-42874][SQL] Enable new golden file test framework for analysis for all input files
amaliujia commented on code in PR #40496: URL: https://github.com/apache/spark/pull/40496#discussion_r1144020709 ## sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala: ## @@ -59,10 +61,39 @@ trait SQLQueryTestHelper extends Logging { */ protected def getNormalizedQueryAnalysisResult( session: SparkSession, sql: String): (String, Seq[String]) = { +// Note that creating the following DataFrame includes eager execution for commands that create +// objects such as views. Therefore any following queries that reference these objects should +// find them in the catalog. val df = session.sql(sql) val schema = df.schema.catalogString -// Get the answer, but also get rid of the #1234 expression IDs that show up in analyzer plans. -(schema, Seq(replaceNotIncludedMsg(df.queryExecution.analyzed.toString))) +val analyzed = df.queryExecution.analyzed +// Determine if the analyzed plan contains any nondeterministic expressions. +var deterministic = true +analyzed.transformAllExpressionsWithSubqueries { + case expr: CurrentDate => +deterministic = false +expr + case expr: CurrentTimestampLike => +deterministic = false +expr + case expr: CurrentUser => +deterministic = false +expr + case expr: Literal if expr.dataType == DateType || expr.dataType == TimestampType => +deterministic = false +expr + case expr if !expr.deterministic => +deterministic = false +expr +} +if (deterministic) { + // Perform query analysis, but also get rid of the #1234 expression IDs that show up in the + // resolved plans. + (schema, Seq(replaceNotIncludedMsg(analyzed.toString))) +} else { + // The analyzed plan is nondeterministic so elide it from the result to keep tests reliable. + (schema, Seq("[Analyzer test output redacted due to nondeterminism]")) +} Review Comment: Should this be configured? Otherwise every caller who calls this function will start to use non-deterministic codepath? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto
grundprinzip commented on PR #39294: URL: https://github.com/apache/spark/pull/39294#issuecomment-1478626808 The current level of proto compat required is the most strict one which requires to be fully compatible, even with the naming of the fields. Strictly speaking, this is not required to be wire compatible. But before we relax the compatibility, I would like to enforce the most strict one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto
amaliujia commented on PR #39294: URL: https://github.com/apache/spark/pull/39294#issuecomment-1478618853 > ![SCR-20230321-v7f](https://user-images.githubusercontent.com/3421/226745165-60b1611a-2ec5-4ddf-98c0-61c3a180ed9c.png) > > Verifying that the broken build is reported. cc @zhenlineo as I recall you thought about field renaming? Seems like that is not allowed per current design. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto
grundprinzip commented on code in PR #39294: URL: https://github.com/apache/spark/pull/39294#discussion_r1144010201 ## .github/workflows/build_and_test.yml: ## @@ -493,6 +494,80 @@ jobs: name: test-results-sparkr--8-${{ inputs.hadoop }}-hive2.3 path: "**/target/test-reports/*.xml" + proto-breaking-changes-check-master: +needs: [precondition] +if: always() && fromJson(needs.precondition.outputs.required).proto-breaking-changes-check == 'true' +name: Spark Connect proto backwards compatibility check (master) +continue-on-error: true +runs-on: ubuntu-22.04 +env: + LC_ALL: C.UTF-8 + LANG: C.UTF-8 + PYSPARK_DRIVER_PYTHON: python3.9 + PYSPARK_PYTHON: python3.9 +steps: +- name: Checkout Spark repository + uses: actions/checkout@v3 + with: +fetch-depth: 0 +repository: apache/spark +ref: ${{ inputs.branch }} +- name: Add GITHUB_WORKSPACE to git trust safe.directory + run: | +git config --global --add safe.directory ${GITHUB_WORKSPACE} +- name: Sync the current branch with the latest in Apache Spark + if: github.repository != 'apache/spark' + run: | +git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/} +git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' merge --no-commit --progress --squash FETCH_HEAD +git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' commit -m "Merged commit" --allow-empty +# Install the `buf` CLI +- uses: bufbuild/buf-setup-action@v1 + with: +github_token: ${{ secrets.GITHUB_TOKEN }} +# Run buf breaking. +- uses: bufbuild/buf-breaking-action@v1 + with: +input: connector/connect/common/src/main +against: 'https://github.com/apache/spark.git#branch=master,subdir=connector/connect/common/src/main' + + proto-breaking-changes-check-release: +needs: [precondition] +if: always() && fromJson(needs.precondition.outputs.required).proto-breaking-changes-check == 'true' +name: Spark Connect proto backwards compatibility check (master) +runs-on: ubuntu-22.04 +env: + LC_ALL: C.UTF-8 + LANG: C.UTF-8 + PYSPARK_DRIVER_PYTHON: python3.9 + PYSPARK_PYTHON: python3.9 +steps: +- name: Checkout Spark repository + uses: actions/checkout@v3 + with: +fetch-depth: 0 +repository: apache/spark +ref: ${{ inputs.branch }} +- name: Add GITHUB_WORKSPACE to git trust safe.directory + run: | +git config --global --add safe.directory ${GITHUB_WORKSPACE} +- name: Sync the current branch with the latest in Apache Spark + if: github.repository != 'apache/spark' + run: | +git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/} +git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' merge --no-commit --progress --squash FETCH_HEAD +git -c user.name='Apache Spark Test Account' -c user.email='sparktest...@gmail.com' commit -m "Merged commit" --allow-empty +# Install the `buf` CLI +- uses: bufbuild/buf-setup-action@v1 + with: +github_token: ${{ secrets.GITHUB_TOKEN }} +# Run buf breaking. +- uses: bufbuild/buf-breaking-action@v1 + with: +input: connector/connect/common/src/main +against: 'https://github.com/apache/spark.git#branch=branch-3.4,subdir=connector/connect/common/src/main' Review Comment: runs against the 3.4 release branch, this has to be updated on every release to the previous branch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto
grundprinzip commented on code in PR #39294: URL: https://github.com/apache/spark/pull/39294#discussion_r1144009897 ## .github/workflows/build_and_test.yml: ## @@ -493,6 +494,80 @@ jobs: name: test-results-sparkr--8-${{ inputs.hadoop }}-hive2.3 path: "**/target/test-reports/*.xml" + proto-breaking-changes-check-master: +needs: [precondition] +if: always() && fromJson(needs.precondition.outputs.required).proto-breaking-changes-check == 'true' +name: Spark Connect proto backwards compatibility check (master) +continue-on-error: true Review Comment: continue-on-error will let the workflow continue even if the job fails. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on pull request #39294: [SPARK-41537][INFRA][TESTS] Github Workflow Check for Breaking Changes in Spark Connect Proto
grundprinzip commented on PR #39294: URL: https://github.com/apache/spark/pull/39294#issuecomment-1478606975 ![SCR-20230321-v7f](https://user-images.githubusercontent.com/3421/226745165-60b1611a-2ec5-4ddf-98c0-61c3a180ed9c.png) Verifying that the broken build is reported. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] StevenChenDatabricks commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
StevenChenDatabricks commented on PR #40385: URL: https://github.com/apache/spark/pull/40385#issuecomment-1478591539 @cloud-fan I've addressed your comments. Thanks for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #40490: [SPARK-42536][BUILD] Upgrade log4j2 to 2.20.0
viirya commented on PR #40490: URL: https://github.com/apache/spark/pull/40490#issuecomment-1478569986 Looks good to me. +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40490: [SPARK-42536][BUILD] Upgrade log4j2 to 2.20.0
dongjoon-hyun commented on PR #40490: URL: https://github.com/apache/spark/pull/40490#issuecomment-1478567572 cc @viirya , too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade `kubernetes-client` to 6.5.1
dongjoon-hyun closed pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade `kubernetes-client` to 6.5.1 URL: https://github.com/apache/spark/pull/40509 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade `kubernetes-client` to 6.5.1
dongjoon-hyun commented on PR #40509: URL: https://github.com/apache/spark/pull/40509#issuecomment-1478564576 Thank you, @viirya . Merged to master for Apache Spark 3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40440: [SPARK-42808][CORE] Avoid getting availableProcessors every time in `MapOutputTrackerMaster#getStatistics`
dongjoon-hyun commented on PR #40440: URL: https://github.com/apache/spark/pull/40440#issuecomment-1478522804 Thank you all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade `kubernetes-client` to 6.5.1
dongjoon-hyun commented on PR #40509: URL: https://github.com/apache/spark/pull/40509#issuecomment-1478454379 Thank you, @amaliujia . Also, could you review this PR, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #40268: [SPARK-42500][SQL] ConstantPropagation support more cases
peter-toth commented on PR #40268: URL: https://github.com/apache/spark/pull/40268#issuecomment-1478292434 @cloud-fan, @wangyum please let me know if this PR needs further improvements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] simonvanderveldt commented on pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s
simonvanderveldt commented on PR #38518: URL: https://github.com/apache/spark/pull/38518#issuecomment-1478291017 > To be safe, could you revisie this PR by adding a new internal configuration like KUBERNETES_EXECUTOR_ENABLE_API_WATCHER and use it with the following? Not sure this makes sense. AFAIK the current behavior is a bug, so it should just be fixed. If there's uncertainty about the implementation of this fix then IMHO this uncertainty should be addressed. For reference, the patch as it was before the config option was added is working fine for us, we run about 800 Spark apps a day. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40509: [SPARK-42885][K8S][BUILD] Upgrade `kubernetes-client` to 6.5.1
amaliujia commented on PR #40509: URL: https://github.com/apache/spark/pull/40509#issuecomment-1478279790 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40447: [SPARK-42816][CONNECT] Support Max Message size up to 128MB
amaliujia commented on PR #40447: URL: https://github.com/apache/spark/pull/40447#issuecomment-1478275510 LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options
amaliujia commented on PR #40498: URL: https://github.com/apache/spark/pull/40498#issuecomment-1478228229 @hvanhovell oh are you thinking about golden file based tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40498: [SPARK-42878][CONNECT] The table API in DataFrameReader could also accept options
amaliujia commented on PR #40498: URL: https://github.com/apache/spark/pull/40498#issuecomment-1478220179 @hvanhovell existing codebase uses this to verify the options: ``` test("SPARK-32844: DataFrameReader.table take the specified options for V1 relation") { withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") { withTable("t") { sql("CREATE TABLE t(i int, d double) USING parquet OPTIONS ('p1'='v1', 'p2'='v2')") val df = spark.read.option("P2", "v2").option("p3", "v3").table("t") val options = df.queryExecution.analyzed.collectFirst { case r: LogicalRelation => r.relation.asInstanceOf[HadoopFsRelation].options }.get assert(options("p2") == "v2") assert(options("p3") == "v3") } } } ``` However this won't work for Spark Connect. Do you know if we have another way to achieve it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #40508: [MINOR][SQL][CONNECT][PYTHON] Clarify the comment of parameterized SQL args
srielau commented on code in PR #40508: URL: https://github.com/apache/spark/pull/40508#discussion_r1143689845 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala: ## @@ -213,7 +213,8 @@ class SparkSession private[sql] ( * @param sqlText * A SQL statement with named parameters to execute. * @param args - * A map of parameter names to literal values. + * A map of parameter names to string values that are parsed as SQL literal expressions. For + * example, map keys: "abc", "def_1", map values: "1", "'abc'", "DATE'2023-03-21'". Review Comment: How about keys that clarify the type, at least implicitly. Also arity didn't match up ```suggestion * example, map keys: "rank", "name", "birthdate"; map values: "1", "'Steven'", "DATE'2023-03-21'". ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on a diff in pull request #38518: [SPARK-33349][K8S] Reset the executor pods watcher when we receive a version changed from k8s
attilapiros commented on code in PR #38518: URL: https://github.com/apache/spark/pull/38518#discussion_r1143681244 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSource.scala: ## @@ -45,7 +46,9 @@ class ExecutorPodsWatchSnapshotSource( conf: SparkConf) extends Logging { private var watchConnection: Closeable = _ Review Comment: The `Watcher` is running on a separate thread, right? And `watchConnection` sometimes accessed from one thread via the `ExecutorPodsWatchSnapshotSource#stop()` and sometimes from the watcher running thread via the `ExecutorPodsWatchSnapshotSource#reset()`. So those methods where `watchConnection` is accessed start(), stop() and reset() must be synchronized. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] unical1988 commented on pull request #40468: "[SPARK-42838][SQL] changed error class name _LEGACY_ERROR_TEMP_2000"
unical1988 commented on PR #40468: URL: https://github.com/apache/spark/pull/40468#issuecomment-1478174533 > Found JIRA: [SPARK-42838](https://issues.apache.org/jira/browse/SPARK-42838). > > Can you attach the JIRA number to title? > > e.g. "[[SPARK-42838](https://issues.apache.org/jira/browse/SPARK-42838)][SQL] changed error class name _LEGACY_ERROR_TEMP_2000" @itholic done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #40444: [SPARK-42813][K8S] Print application info when waitAppCompletion is false
pan3793 commented on PR #40444: URL: https://github.com/apache/spark/pull/40444#issuecomment-1478154866 Thank you, @dongjoon-hyun and @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org