[GitHub] [spark] khalidmammadov commented on a diff in pull request #40916: [SPARK-43243][PYTHON][CONNECT] Add level param to printSchema for Python
khalidmammadov commented on code in PR #40916: URL: https://github.com/apache/spark/pull/40916#discussion_r1176026841 ## python/pyspark/sql/dataframe.py: ## @@ -584,14 +584,14 @@ def printSchema(self, level: Optional[int] = None) -> None: .. versionchanged:: 3.4.0 Supports Spark Connect. -.. versionchanged:: 3.5.0 -Added Level parameter. - Parameters -- level : int, optional, default None How many levels to print for nested schemas. +.. versionchanged:: 3.5.0 Review Comment: Good spot! fixed. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module
cloud-fan commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1176020421 ## common/utils/src/main/scala/org/apache/spark/JsonProtocol.scala: ## @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import java.io.ByteArrayOutputStream +import java.nio.charset.StandardCharsets + +import com.fasterxml.jackson.core.{JsonEncoding, JsonGenerator} +import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper} +import com.fasterxml.jackson.module.scala.DefaultScalaModule + + +private[spark] object JsonProtocolUtil { Review Comment: the name could be just `JsonUtils`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module
cloud-fan commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1176020179 ## common/utils/src/main/scala/org/apache/spark/JsonProtocol.scala: ## @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import java.io.ByteArrayOutputStream +import java.nio.charset.StandardCharsets + +import com.fasterxml.jackson.core.{JsonEncoding, JsonGenerator} +import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper} +import com.fasterxml.jackson.module.scala.DefaultScalaModule + + +private[spark] object JsonProtocolUtil { Review Comment: should this live in the `org.apache.spark.util` package? and can we call this util function in `org.apache.spark.util.JsonProtocol.toJsonString` to avoid duplicated code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module
cloud-fan commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1176019274 ## common/utils/src/main/scala/org/apache/spark/SparkException.scala: ## @@ -18,14 +18,13 @@ package org.apache.spark import java.io.FileNotFoundException +import java.nio.file.FileAlreadyExistsException import java.sql.{SQLException, SQLFeatureNotSupportedException} import java.time.DateTimeException import java.util.ConcurrentModificationException import scala.collection.JavaConverters._ -import org.apache.hadoop.fs.FileAlreadyExistsException Review Comment: is this a breaking change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module
amaliujia commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1176017207 ## project/MimaExcludes.scala: ## @@ -66,6 +66,12 @@ object MimaExcludes { ProblemFilters.exclude[Problem]("org.sparkproject.spark_core.protobuf.*"), ProblemFilters.exclude[Problem]("org.apache.spark.status.protobuf.StoreTypes*"), +// SPARK-43265: Move Error framework to a common utils module + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.QueryContext"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException$"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkThrowable"), Review Comment: I don't think I renamed the package name. This throws from spark-core? Unless spark-core builds the uber jar and this checks the uber jar otherwise it should throw? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module
amaliujia commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1176017207 ## project/MimaExcludes.scala: ## @@ -66,6 +66,12 @@ object MimaExcludes { ProblemFilters.exclude[Problem]("org.sparkproject.spark_core.protobuf.*"), ProblemFilters.exclude[Problem]("org.apache.spark.status.protobuf.StoreTypes*"), +// SPARK-43265: Move Error framework to a common utils module + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.QueryContext"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException$"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkThrowable"), Review Comment: I don't think I renamed the package name. This throws from spark-core? Unless spark-core builds the iber jar and this checks the uber jar otherwise it should throw? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module
cloud-fan commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1176015036 ## project/MimaExcludes.scala: ## @@ -66,6 +66,12 @@ object MimaExcludes { ProblemFilters.exclude[Problem]("org.sparkproject.spark_core.protobuf.*"), ProblemFilters.exclude[Problem]("org.apache.spark.status.protobuf.StoreTypes*"), +// SPARK-43265: Move Error framework to a common utils module + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.QueryContext"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException$"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkThrowable"), Review Comment: Did we rename the package? Otherwise it should be still binary compatible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module
amaliujia commented on PR #40931: URL: https://github.com/apache/spark/pull/40931#issuecomment-1521157761 `continuous-integration/appveyor/pr` seems to be stuck. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40892: [SPARK-43128][CONNECT] Make `recentProgress` and `lastProgress` return `StreamingQueryProgress` consistent with the native Scal
LuciferYang commented on code in PR #40892: URL: https://github.com/apache/spark/pull/40892#discussion_r1175980270 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryProgressSuite.scala: ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import scala.jdk.CollectionConverters._ + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema +import org.apache.spark.sql.connect.client.util.ConnectFunSuite +import org.apache.spark.sql.types.StructType + +class StreamingQueryProgressSuite extends ConnectFunSuite { + test("test seder StreamingQueryProgress from json") { +val jsonStringFromServerSide = + s""" + |{ + | "id" : "33ac26f4-1c39-46ce-b798-f3d2a21211d4", + | "runId" : "849c2c9a-b9f8-446f-9180-259a60fd888c", + | "name" : "myName", + | "timestamp" : "2016-12-05T20:54:20.827Z", + | "batchId" : 2, + | "batchDuration" : 0, + | "durationMs" : { + |"total" : 0 + | }, + | "eventTime" : { + |"min" : "2016-12-05T20:54:20.827Z", + |"avg" : "2016-12-05T20:54:20.827Z", + |"watermark" : "2016-12-05T20:54:20.827Z", + |"max" : "2016-12-05T20:54:20.827Z" + | }, + | "stateOperators" : [ { + |"operatorName" : "op1", + |"numRowsTotal" : 0, + |"numRowsUpdated" : 1, + |"allUpdatesTimeMs" : 1, + |"numRowsRemoved" : 2, + |"allRemovalsTimeMs" : 34, + |"commitTimeMs" : 23, + |"memoryUsedBytes" : 3, + |"numRowsDroppedByWatermark" : 0, + |"numShufflePartitions" : 2, + |"numStateStoreInstances" : 2, + |"customMetrics" : { + | "stateOnCurrentVersionSizeBytes" : 2, + | "loadedMapCacheHitCount" : 1, + | "loadedMapCacheMissCount" : 0 + |} + | } ], + | "sources" : [ { + |"description" : "source", + |"startOffset" : "123", + |"endOffset" : "456", + |"latestOffset" : "789", + |"numInputRows" : 678, + |"inputRowsPerSecond" : 10.0, + |"processedRowsPerSecond" : "Infinity", + |"metrics" : { } + | } ], + | "sink" : { + |"description" : "sink", + |"numOutputRows" : -1, + |"metrics" : { } + | }, + | "observedMetrics" : { + |"event1" : { + | "values" : [ 1, 3.0 ], + | "schema" : { + |"type" : "struct", + |"fields" : [ { + | "name" : "c1", + | "type" : "long", + | "nullable" : true, + | "metadata" : { } + |}, { + | "name" : "c2", + | "type" : "double", + | "nullable" : true, + | "metadata" : { } + |} ] + | } + |}, + |"event2" : { + | "values" : [ 1, "hello", "world" ], + | "schema" : { + |"type" : "struct", + |"fields" : [ { + | "name" : "rc", + | "type" : "long", + | "nullable" : true, + | "metadata" : { } + |}, { + | "name" : "min_q", + | "type" : "string", + | "nullable" : true, + | "metadata" : { } + |}, { + | "name" : "max_q", + | "type" : "string", + | "nullable" : true, + | "metadata" : { } + |} ] + | } + |} + | } + |} + """.stripMargin.trim + +// scalastyle:off +println(jsonStringFromServerSide) + +val result = StreamingQueryProgress.fromJson(jsonStringFromServerSide) +assert(result.id.toString === "33ac26f4-1c39-46ce-b798-f3d2a21211d4")
[GitHub] [spark] zhengruifeng commented on pull request #40927: [SPARK-42419][FOLLOWUP][CONNECT][PYTHON] Remove unused exception
zhengruifeng commented on PR #40927: URL: https://github.com/apache/spark/pull/40927#issuecomment-152978 oh, the linter fails -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError
zhengruifeng commented on code in PR #40617: URL: https://github.com/apache/spark/pull/40617#discussion_r1175974399 ## python/pyspark/errors/error_classes.py: ## @@ -79,11 +114,26 @@ "returnType can not be specified when `` is a user-defined function, but got ." ] }, + "CANNOT_UNPERSIST_BROADCAST": { +"message": [ + "Broadcast can only be unpersisted in driver." +] + }, "COLUMN_IN_LIST": { "message": [ "`` does not allow a Column in a list." ] }, + "CONTEXT_ONLY_VALID_ON_DRIVER" : { +"message" : [ + "It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063." +] + }, + "CONTEXT_UNAVAILABLE_FOR_REMOTE_CLIENT" : { +"message" : [ + "Remote client cannot create a SparkContext. Create SparkSession instead." Review Comment: NVM, I feel it says that `Remote client cannot create a SparkContext. So the SparkConnect Python Client create a session instead`, while it actually raise a error and fails. Not a big deal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40936: [MINOR][CONNECT] Remove unnecessary creation of `planner` in `handleWriteOperation` and `handleWriteOperationV2`
zhengruifeng commented on PR #40936: URL: https://github.com/apache/spark/pull/40936#issuecomment-1521106087 thanks @LuciferYang for review, merged into master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #40936: [MINOR][CONNECT] Remove unnecessary creation of `planner` in `handleWriteOperation` and `handleWriteOperationV2`
zhengruifeng closed pull request #40936: [MINOR][CONNECT] Remove unnecessary creation of `planner` in `handleWriteOperation` and `handleWriteOperationV2` URL: https://github.com/apache/spark/pull/40936 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bogao007 commented on a diff in pull request #40892: [SPARK-43128][CONNECT] Make `recentProgress` and `lastProgress` return `StreamingQueryProgress` consistent with the native Scala A
bogao007 commented on code in PR #40892: URL: https://github.com/apache/spark/pull/40892#discussion_r1175960363 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryProgressSuite.scala: ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import scala.jdk.CollectionConverters._ + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema +import org.apache.spark.sql.connect.client.util.ConnectFunSuite +import org.apache.spark.sql.types.StructType + +class StreamingQueryProgressSuite extends ConnectFunSuite { + test("test seder StreamingQueryProgress from json") { +val jsonStringFromServerSide = + s""" + |{ + | "id" : "33ac26f4-1c39-46ce-b798-f3d2a21211d4", + | "runId" : "849c2c9a-b9f8-446f-9180-259a60fd888c", + | "name" : "myName", + | "timestamp" : "2016-12-05T20:54:20.827Z", + | "batchId" : 2, + | "batchDuration" : 0, + | "durationMs" : { + |"total" : 0 + | }, + | "eventTime" : { + |"min" : "2016-12-05T20:54:20.827Z", + |"avg" : "2016-12-05T20:54:20.827Z", + |"watermark" : "2016-12-05T20:54:20.827Z", + |"max" : "2016-12-05T20:54:20.827Z" + | }, + | "stateOperators" : [ { + |"operatorName" : "op1", + |"numRowsTotal" : 0, + |"numRowsUpdated" : 1, + |"allUpdatesTimeMs" : 1, + |"numRowsRemoved" : 2, + |"allRemovalsTimeMs" : 34, + |"commitTimeMs" : 23, + |"memoryUsedBytes" : 3, + |"numRowsDroppedByWatermark" : 0, + |"numShufflePartitions" : 2, + |"numStateStoreInstances" : 2, + |"customMetrics" : { + | "stateOnCurrentVersionSizeBytes" : 2, + | "loadedMapCacheHitCount" : 1, + | "loadedMapCacheMissCount" : 0 + |} + | } ], + | "sources" : [ { + |"description" : "source", + |"startOffset" : "123", + |"endOffset" : "456", + |"latestOffset" : "789", + |"numInputRows" : 678, + |"inputRowsPerSecond" : 10.0, + |"processedRowsPerSecond" : "Infinity", + |"metrics" : { } + | } ], + | "sink" : { + |"description" : "sink", + |"numOutputRows" : -1, + |"metrics" : { } + | }, + | "observedMetrics" : { + |"event1" : { + | "values" : [ 1, 3.0 ], + | "schema" : { + |"type" : "struct", + |"fields" : [ { + | "name" : "c1", + | "type" : "long", + | "nullable" : true, + | "metadata" : { } + |}, { + | "name" : "c2", + | "type" : "double", + | "nullable" : true, + | "metadata" : { } + |} ] + | } + |}, + |"event2" : { + | "values" : [ 1, "hello", "world" ], + | "schema" : { + |"type" : "struct", + |"fields" : [ { + | "name" : "rc", + | "type" : "long", + | "nullable" : true, + | "metadata" : { } + |}, { + | "name" : "min_q", + | "type" : "string", + | "nullable" : true, + | "metadata" : { } + |}, { + | "name" : "max_q", + | "type" : "string", + | "nullable" : true, + | "metadata" : { } + |} ] + | } + |} + | } + |} + """.stripMargin.trim + +// scalastyle:off +println(jsonStringFromServerSide) + +val result = StreamingQueryProgress.fromJson(jsonStringFromServerSide) +assert(result.id.toString === "33ac26f4-1c39-46ce-b798-f3d2a21211d4") +
[GitHub] [spark] itholic commented on pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError
itholic commented on PR #40617: URL: https://github.com/apache/spark/pull/40617#issuecomment-1521086780 Applied the comments, thanks @zhengruifeng ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError
itholic commented on code in PR #40617: URL: https://github.com/apache/spark/pull/40617#discussion_r1175958890 ## python/pyspark/errors/error_classes.py: ## @@ -79,11 +114,26 @@ "returnType can not be specified when `` is a user-defined function, but got ." ] }, + "CANNOT_UNPERSIST_BROADCAST": { +"message": [ + "Broadcast can only be unpersisted in driver." +] + }, "COLUMN_IN_LIST": { "message": [ "`` does not allow a Column in a list." ] }, + "CONTEXT_ONLY_VALID_ON_DRIVER" : { +"message" : [ + "It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063." +] + }, + "CONTEXT_UNAVAILABLE_FOR_REMOTE_CLIENT" : { +"message" : [ + "Remote client cannot create a SparkContext. Create SparkSession instead." Review Comment: Sorry, I didn't catch the point. The message suggests using a Spark session because it is not possible to create a SparkContext in a remote session. Do you have any proposed suggestion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40430: [SPARK-42798][BUILD] Upgrade protobuf-java to 3.22.3
LuciferYang commented on PR #40430: URL: https://github.com/apache/spark/pull/40430#issuecomment-1521078012 All test passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError
itholic commented on code in PR #40617: URL: https://github.com/apache/spark/pull/40617#discussion_r1175953882 ## python/pyspark/errors/error_classes.py: ## @@ -79,11 +114,26 @@ "returnType can not be specified when `` is a user-defined function, but got ." ] }, + "CANNOT_UNPERSIST_BROADCAST": { Review Comment: Sounds good. Seems like we can consolidate `CANNOT_DESTROY_BROADCAST` as well :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40738: [SPARK-42380][BUILD] Upgrade Apache Maven to 3.9.1
LuciferYang commented on PR #40738: URL: https://github.com/apache/spark/pull/40738#issuecomment-1521073578 @steveloughran Do you have any other questions? I am planning to close this PR and test Apache Maven 3.9.2+ in the future to check if it is possible not to add `-Dmaven.resolver.transport=wagon`. Apache Spark can continue to use 3.8.7 now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #40893: [SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
pan3793 commented on PR #40893: URL: https://github.com/apache/spark/pull/40893#issuecomment-1521073105 > OK, am I right that this does not make Spark any _less_ compatible with any version of Hive that is currently supported (>= 2.3.9)? If so then this is fine Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] melin commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics
melin commented on PR #38496: URL: https://github.com/apache/spark/pull/38496#issuecomment-1521072841 > @melin I have opened another pr #39114 for this, and unfortunately that one was closed because of no feedback for a long time. If necessary, I can consider reopening it. reopen it,thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40847: [SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml`
LuciferYang commented on PR #40847: URL: https://github.com/apache/spark/pull/40847#issuecomment-1521071604 > Interesting, thanks for the detailed analysis @LuciferYang ! > > > Use hadoop 3.2.x can't build hadoop-cloud module too > > This is Hadoop 3.2.2 ? I remember at some point we started to enable `hadoop-cloud` in Spark release, so I wonder why this didn't cause any error back in the time .. I test hadoop 3.2.4. `AbortableStreamBasedCheckpointFileManager` was introduced in SPARK-40039, and it uses APIs that are only available in Hadoop 3.3.1+([HADOOP-16906](https://issues.apache.org/jira/browse/HADOOP-16906) `FSDataOutputStream#abort()`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #40927: [SPARK-42419][FOLLOWUP][CONNECT][PYTHON] Remove unused exception
itholic commented on PR #40927: URL: https://github.com/apache/spark/pull/40927#issuecomment-1521055105 > what about changing to? @zhengruifeng Nice! Just applied the comment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #40722: [SPARK-43076][PS][CONNECT] Removing the dependency on `grpcio` when remote session is not used.
itholic commented on PR #40722: URL: https://github.com/apache/spark/pull/40722#issuecomment-1521051461 @HyukjinKwon Test passed. There is anything else we should address here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40922: [SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show
cloud-fan commented on PR #40922: URL: https://github.com/apache/spark/pull/40922#issuecomment-1521043238 > would it be possible to not make any changes in Cast and do everything in df.show method? We can by duplicating the code of `Cast`, but I don't think that's a good idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jackylee-ch commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics
jackylee-ch commented on PR #38496: URL: https://github.com/apache/spark/pull/38496#issuecomment-1521041727 @melin I have opened another pr #39114 for this, and unfortunately that one was closed because of no feedback for a long time. If necessary, I can consider reopening it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL commented on pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API
WweiL commented on PR #40906: URL: https://github.com/apache/spark/pull/40906#issuecomment-1521041159 > CheckConnectJvmClientCompatibility Ah thanks! Removed it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #40936: [MINOR][CONNECT] Remove unnecessary creation of `planner` in `handleWriteOperation` and `handleWriteOperationV2`
zhengruifeng opened a new pull request, #40936: URL: https://github.com/apache/spark/pull/40936 ### What changes were proposed in this pull request? Remove unnecessary creation of `planner` in `handleWriteOperation` and `handleWriteOperationV2` ### Why are the changes needed? `handleWriteOperation` and `handleWriteOperationV2` themselves are the methods of planner, no need to create another planner to call `transformRelation` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40928: [SPARK-43262][CONNECT][SS][PYTHON] Migrate Spark Connect Structured Streaming errors into error class
zhengruifeng commented on PR #40928: URL: https://github.com/apache/spark/pull/40928#issuecomment-1521036530 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #40928: [SPARK-43262][CONNECT][SS][PYTHON] Migrate Spark Connect Structured Streaming errors into error class
zhengruifeng closed pull request #40928: [SPARK-43262][CONNECT][SS][PYTHON] Migrate Spark Connect Structured Streaming errors into error class URL: https://github.com/apache/spark/pull/40928 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError
zhengruifeng commented on code in PR #40617: URL: https://github.com/apache/spark/pull/40617#discussion_r1175926004 ## python/pyspark/errors/error_classes.py: ## @@ -79,11 +114,26 @@ "returnType can not be specified when `` is a user-defined function, but got ." ] }, + "CANNOT_UNPERSIST_BROADCAST": { Review Comment: is it possible to combine `CANNOT_UNPERSIST_BROADCAST` `CANNOT_REDUCE_BROADCAST` and `CANNOT_DESTROY_BROADCAST` to a single `INVALID_BROADCAST_OPERATION` with parameter `operation` ## python/pyspark/errors/error_classes.py: ## @@ -29,6 +34,21 @@ "Attribute `` in provided object `` is not callable." ] }, + "BARRIER_TASK_CONTEXT_NOT_INITIALIZE": { +"message": [ + "Not supported to call `` before initialize BarrierTaskContext." +] + }, + "BROADCAST_VARIABLE_NOT_LOADED": { +"message": [ + "Broadcast variable `` not loaded." +] + }, + "CALL_BEFORE_SESSION_INITIALIZE" : { Review Comment: what about combine `BARRIER_TASK_CONTEXT_NOT_INITIALIZE` and `CALL_BEFORE_SESSION_INITIALIZE` to a general `CALL_BEFORE_INITIALIZE` with an extra parameter `object`? ## python/pyspark/errors/error_classes.py: ## @@ -79,11 +114,26 @@ "returnType can not be specified when `` is a user-defined function, but got ." ] }, + "CANNOT_UNPERSIST_BROADCAST": { +"message": [ + "Broadcast can only be unpersisted in driver." +] + }, "COLUMN_IN_LIST": { "message": [ "`` does not allow a Column in a list." ] }, + "CONTEXT_ONLY_VALID_ON_DRIVER" : { +"message" : [ + "It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063." +] + }, + "CONTEXT_UNAVAILABLE_FOR_REMOTE_CLIENT" : { +"message" : [ + "Remote client cannot create a SparkContext. Create SparkSession instead." Review Comment: I think the message is misleading, it implies a `SparkSession` was created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] melin commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics
melin commented on PR #38496: URL: https://github.com/apache/spark/pull/38496#issuecomment-1521033489 > > Support partition statistics? > > @melin I'm working on the supporting of partition statistics update, it relies on workers to return detailed partition statistics. Is it done? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #40665: [SPARK-42621][PS] Add inclusive parameter for pd.date_range
itholic commented on code in PR #40665: URL: https://github.com/apache/spark/pull/40665#discussion_r1175922744 ## python/pyspark/pandas/tests/test_namespace.py: ## @@ -221,13 +221,13 @@ def test_date_range(self): ) self.assert_eq( -ps.date_range(start="2017-01-01", end="2017-01-04", closed="left"), -pd.date_range(start="2017-01-01", end="2017-01-04", closed="left"), +ps.date_range(start="2017-01-01", end="2017-01-04", inclusive="left"), +pd.date_range(start="2017-01-01", end="2017-01-04", inclusive="left"), ) self.assert_eq( -ps.date_range(start="2017-01-01", end="2017-01-04", closed="right"), -pd.date_range(start="2017-01-01", end="2017-01-04", closed="right"), +ps.date_range(start="2017-01-01", end="2017-01-04", inclusive="right"), +pd.date_range(start="2017-01-01", end="2017-01-04", inclusive="right"), Review Comment: Can we also have a tests for "both", "neither" and negative case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40927: [SPARK-42419][FOLLOWUP][CONNECT][PYTHON] Remove unused exception
zhengruifeng commented on PR #40927: URL: https://github.com/apache/spark/pull/40927#issuecomment-1521023572 to be more consistent with https://github.com/apache/spark/blob/a45affe3c8e7a724aea7dbbc1af08e36001c7540/python/pyspark/sql/column.py#L924-L932 what about changing to? ``` if isinstance(length, Column): length_expr = length._expr start_expr = startPos._expr elif isinstance(length, int): length_expr = LiteralExpression._from_value(length) start_expr = LiteralExpression._from_value(startPos) else: raise PySparkTypeError( error_class="NOT_COLUMN_OR_INT", message_parameters={"arg_name": "length", "arg_type": type(length).__name__}, ) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bogao007 commented on a diff in pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API
bogao007 commented on code in PR #40906: URL: https://github.com/apache/spark/pull/40906#discussion_r1175914088 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala: ## @@ -199,6 +205,26 @@ class RemoteStreamingQuery( // scalastyle:on println } + override def exception: Option[StreamingQueryException] = { +val exception = executeQueryCmd(_.setException(true)).getException +if (exception.hasExceptionMessage) { + // TODO(SPARK-43206): Add more information to StreamingQueryException. + Some( +new StreamingQueryException( + "", + cause = null, + "", + "", + errorClass = "STREAM_FAILED", Review Comment: I see, thanks for the clarification! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #40893: [SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
srowen commented on PR #40893: URL: https://github.com/apache/spark/pull/40893#issuecomment-1521012401 OK, am I right that this does not make Spark any _less_ compatible with any version of Hive that is currently supported (>= 2.3.9)? If so then this is fine -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL commented on a diff in pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API
WweiL commented on code in PR #40906: URL: https://github.com/apache/spark/pull/40906#discussion_r1175911789 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala: ## @@ -199,6 +205,26 @@ class RemoteStreamingQuery( // scalastyle:on println } + override def exception: Option[StreamingQueryException] = { +val exception = executeQueryCmd(_.setException(true)).getException +if (exception.hasExceptionMessage) { + // TODO(SPARK-43206): Add more information to StreamingQueryException. + Some( +new StreamingQueryException( + "", + cause = null, + "", + "", + errorClass = "STREAM_FAILED", Review Comment: I believe it's not yet... The general error handling framework for jvm client is not ready yet. And they only include error message and stack trace so far. I'm thinking to implement our own proto message to also include the offsets, which will be added in next PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message
aokolnychyi commented on PR #40934: URL: https://github.com/apache/spark/pull/40934#issuecomment-1521009283 Thanks for reviewing, @gengliangwang @sunchao @huaxingao @viirya! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40916: [SPARK-43243][PYTHON][CONNECT] Add level param to printSchema for Python
zhengruifeng commented on code in PR #40916: URL: https://github.com/apache/spark/pull/40916#discussion_r1175909780 ## python/pyspark/sql/dataframe.py: ## @@ -584,14 +584,14 @@ def printSchema(self, level: Optional[int] = None) -> None: .. versionchanged:: 3.4.0 Supports Spark Connect. -.. versionchanged:: 3.5.0 -Added Level parameter. - Parameters -- level : int, optional, default None How many levels to print for nested schemas. +.. versionchanged:: 3.5.0 Review Comment: I think the indent is not right, you may refer to https://github.com/apache/spark/blob/7f724c3bc7567b0cddc09d5bed11b79879533368/python/pyspark/sql/dataframe.py#L1917-L1921 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bogao007 commented on a diff in pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API
bogao007 commented on code in PR #40906: URL: https://github.com/apache/spark/pull/40906#discussion_r1175907637 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala: ## @@ -199,6 +205,26 @@ class RemoteStreamingQuery( // scalastyle:on println } + override def exception: Option[StreamingQueryException] = { +val exception = executeQueryCmd(_.setException(true)).getException +if (exception.hasExceptionMessage) { + // TODO(SPARK-43206): Add more information to StreamingQueryException. + Some( +new StreamingQueryException( + "", + cause = null, + "", + "", + errorClass = "STREAM_FAILED", Review Comment: Just curious, is the existing error framework accessible in Spark Connect jvm client? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #40905: [SPARK-43233] [SS] Add logging for Kafka Batch Reading for topic partition, offset range and task ID
HeartSaVioR commented on PR #40905: URL: https://github.com/apache/spark/pull/40905#issuecomment-1520997483 @siying Thanks for your first contribution to Apache Spark project. Welcome! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #40905: [SPARK-43233] [SS] Add logging for Kafka Batch Reading for topic partition, offset range and task ID
HeartSaVioR closed pull request #40905: [SPARK-43233] [SS] Add logging for Kafka Batch Reading for topic partition, offset range and task ID URL: https://github.com/apache/spark/pull/40905 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on pull request #40610: [SPARK-42626][CONNECT] Add Destructive Iterator for SparkResult
ivoson commented on PR #40610: URL: https://github.com/apache/spark/pull/40610#issuecomment-1520993028 Hi @hvanhovell , could you please take a look at this PR? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module
amaliujia commented on PR #40931: URL: https://github.com/apache/spark/pull/40931#issuecomment-1520990871 ok looks like we also have compatibility checks on Spark-core ``` [error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.12:3.4.0! Found 4 potential problems (filtered 4036) [error] * interface org.apache.spark.QueryContext does not have a correspondent in current version [error]filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.QueryContext") [error] * class org.apache.spark.SparkException does not have a correspondent in current version [error]filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException") [error] * object org.apache.spark.SparkException does not have a correspondent in current version [error]filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException$") [error] * interface org.apache.spark.SparkThrowable does not have a correspondent in current version [error]filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkThrowable") ``` Are we ok to update the check? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message
sunchao commented on PR #40934: URL: https://github.com/apache/spark/pull/40934#issuecomment-1520990636 Merged to master, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao closed pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message
sunchao closed pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message URL: https://github.com/apache/spark/pull/40934 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics
github-actions[bot] commented on PR #38496: URL: https://github.com/apache/spark/pull/38496#issuecomment-1520984846 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message
aokolnychyi commented on PR #40934: URL: https://github.com/apache/spark/pull/40934#issuecomment-1520982569 The error messages don't seem related. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module
amaliujia commented on PR #40931: URL: https://github.com/apache/spark/pull/40931#issuecomment-1520961265 Trying to exclude the MIMA check for `common-utils`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #40905: [SPARK-43233] [SS] Add logging for Kafka Batch Reading for topic partition, offset range and task ID
HeartSaVioR commented on PR #40905: URL: https://github.com/apache/spark/pull/40905#issuecomment-1520959155 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #40922: [SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show
sadikovi commented on PR #40922: URL: https://github.com/apache/spark/pull/40922#issuecomment-1520954545 I think the PR still introduces user-facing changes. Also, would it be possible to not make any changes in Cast and do everything in `df.show` method? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #40847: [SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml`
sunchao commented on PR #40847: URL: https://github.com/apache/spark/pull/40847#issuecomment-1520953123 Interesting, thanks for the detailed analysis @LuciferYang ! > Use hadoop 3.2.x can't build hadoop-cloud module too This is Hadoop 3.2.2 ? I remember at some point we started to enable `hadoop-cloud` in Spark release, so I wonder why this didn't cause any error back in the time .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BeishaoCao-db commented on a diff in pull request #40907: [PYTHON] Implement `__dir__()` in `pyspark.sql.dataframe.DataFrame` to include columns
BeishaoCao-db commented on code in PR #40907: URL: https://github.com/apache/spark/pull/40907#discussion_r1175813851 ## python/pyspark/sql/dataframe.py: ## @@ -3008,6 +3008,25 @@ def __getattr__(self, name: str) -> Column: jc = self._jdf.apply(name) return Column(jc) +def __dir__(self): +""" +Examples + Review Comment: added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] BeishaoCao-db commented on pull request #40907: [PYTHON] Implement `__dir__()` in `pyspark.sql.dataframe.DataFrame` to include columns
BeishaoCao-db commented on PR #40907: URL: https://github.com/apache/spark/pull/40907#issuecomment-1520825250 > engine that uses `dir()` to generate autocomplete suggestions (e.g. IPython kernel, Databricks Notebooks) will suggest column names on the completion `df.|` Sure, create one ASF JIRA: https://issues.apache.org/jira/browse/SPARK-43270 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL opened a new pull request, #40935: [SPARK-43206] [SS] [CONNECT] [DRAFT] [DO-NOT-REVIEW] StreamingQuery exception() include stack trace
WweiL opened a new pull request, #40935: URL: https://github.com/apache/spark/pull/40935 ### What changes were proposed in this pull request? Add stack trace to streamingQuery's `exception()` method. Following https://github.com/apache/spark/commit/a5c8a3c976889f33595ac18f82e73e6b9fd29b57#diff-98baf452f0352c75a39f39351c5f9e656675810b6d4cfd178f1b0bae9751495b Add to both python client and scala client ### Why are the changes needed? Including stack trace is helpful in debugging ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested: 1. Python: ``` JVM stacktrace: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (ip-10-110-19-234.us-west-2.compute.internal executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/worker.py", line 850, in main process() ``` 2. Scala: TODO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message
aokolnychyi commented on code in PR #40934: URL: https://github.com/apache/spark/pull/40934#discussion_r1175744775 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala: ## @@ -56,7 +56,10 @@ class NamespaceAlreadyExistsException private( } def this(message: String) = { -this(message, errorClass = None, messageParameters = Map.empty[String, String]) +this( + message, + errorClass = Some("SCHEMA_ALREADY_EXISTS"), Review Comment: It is a bit unfortunate we have to use the same string literal in multiple places. Adding a constant would require a companion object and an import as these values are accessed in constructors. I was not sure it would be worth it so kept as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message
aokolnychyi commented on PR #40934: URL: https://github.com/apache/spark/pull/40934#issuecomment-1520774365 cc @gengliangwang @dongjoon-hyun @viirya @huaxingao @sunchao @cloud-fan @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi opened a new pull request, #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message
aokolnychyi opened a new pull request, #40934: URL: https://github.com/apache/spark/pull/40934 ### What changes were proposed in this pull request? This PR makes sure each exception affected in PR #40679 has a proper error class when constructed with an explicit message. ### Why are the changes needed? These changes are needed per [this discussion](https://github.com/apache/spark/pull/40679#discussion_r1159264585). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pjfanning commented on pull request #40933: [SPARK-43263][BUILD] Upgrade `FasterXML jackson` to 2.15.0
pjfanning commented on PR #40933: URL: https://github.com/apache/spark/pull/40933#issuecomment-1520686760 if you use jackson 2.14.2 - you can just upgrade snakeyaml to 2.x jackson 2.15.0 has extra changes that you need to worry about - see https://issues.apache.org/jira/browse/SPARK-42854 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #40933: [SPARK-43263][BUILD] Upgrade `FasterXML jackson` to 2.15.0
bjornjorgensen commented on PR #40933: URL: https://github.com/apache/spark/pull/40933#issuecomment-1520673919 CC @pjfanning -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen opened a new pull request, #40933: [SPARK-43263][BUILD] Upgrade `FasterXML jackson` to 2.15.0
bjornjorgensen opened a new pull request, #40933: URL: https://github.com/apache/spark/pull/40933 ### What changes were proposed in this pull request? Upgrade FasterXML jackson from 2.14.2 to 2.15.0 ### Why are the changes needed? Upgrade Snakeyaml to 2.0 (resolves CVE-2022-1471 [CVE-2022-1471 at nist](https://nvd.nist.gov/vuln/detail/CVE-2022-1471) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module
amaliujia commented on PR #40931: URL: https://github.com/apache/spark/pull/40931#issuecomment-1520666866 This PR fails on ``` Error: running /home/runner/work/spark/spark/dev/mima -Phadoop-3 -Pdocker-integration-tests -Pyarn -Pkubernetes -Pspark-ganglia-lgpl -Pmesos -Pconnect -Phive -Phadoop-cloud -Pkinesis-asl -Phive-thriftserver -Pvolcano ; received return code 1 ``` but there is no error messages about how to fix. I run this locally which output nothing as well. Stuck on mima stuff now, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL commented on pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API
WweiL commented on PR #40906: URL: https://github.com/apache/spark/pull/40906#issuecomment-1520665915 Stack trace will be added in this ticket https://issues.apache.org/jira/browse/SPARK-43206, after this PR is merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL commented on pull request #40887: [SPARK-43144] Scala Client DataStreamReader table() API
WweiL commented on PR #40887: URL: https://github.com/apache/spark/pull/40887#issuecomment-1520622334 @amaliujia @HyukjinKwon I also changed `ProtoToParsedPlanTestSuite` a little to remove the memory addresses, before the change the test for streaming table only would fail with: ``` - streaming_table_API_with_options *** FAILED *** (8 milliseconds) [info] Expected and actual plans do not match: [info] [info] === Expected Plan === [info] SubqueryAlias primary.tempdb.myStreamingTable [info] +- StreamingRelationV2 primary.tempdb.myStreamingTable, org.apache.spark.sql.connector.catalog.InMemoryTable@752725d9, [p1=v1, p2=v2], [id#0L], org.apache.spark.sql.connector.catalog.InMemoryCatalog@347d8e2a, tempdb.myStreamingTable [info] [info] [info] === Actual Plan === [info] SubqueryAlias primary.tempdb.myStreamingTable [info] +- StreamingRelationV2 primary.tempdb.myStreamingTable, org.apache.spark.sql.connector.catalog.InMemoryTable@a88a5db, [p1=v1, p2=v2], [id#0L], org.apache.spark.sql.connector.catalog.InMemoryCatalog@2c6b362e, tempdb.myStreamingTable ``` Because the memory address is different every time it runs. I removed these in the test suite. And verified that memory addresses doesn't exist in existing explain files: ``` wei.liu:~/oss-spark$ cat connector/connect/common/src/test/resources/query-tests/explain-results/* | grep @ wei.liu:~/oss-spark$ ``` PTAL, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175623971 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technically, a file format could choose suppress them, but that is not recommended. */ def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS + + /** + * The extractors to use when deriving file-constant metadata columns for this file format. + * + * By default, the value of a file-constant metadata column is obtained by looking up the column's + * name in the file's metadata column value map. However, implementations can override this method + * in order to provide an extractor that has access to the entire [[PartitionedFile]] when + * deriving the column's value. + * + * NOTE: Extractors are lazy, invoked only if the query actually selects their column at runtime. + * + * See also [[FileFormat.getFileConstantMetadataColumnValue]]. + */ + def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] = Review Comment: doc comments updated, PTAL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40899: [SPARK-43249][CONNECT] Fix missing stats for SQL Command
amaliujia commented on PR #40899: URL: https://github.com/apache/spark/pull/40899#issuecomment-1520596794 Thanks for adding the JIRA! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #40893: [SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
sunchao commented on PR #40893: URL: https://github.com/apache/spark/pull/40893#issuecomment-1520573329 @pan3793 AFAIK the development efforts in Hive community are only in Hive 3.x/4.x at the moment, and the 2.x branch is barely maintained. I can try to start a conversation in the Hive community to have a new 2.3.10 release and see how it looks like. From the long term perspective, it'd be better for Spark to move to Hive 3.x/4.x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader
sunchao commented on PR #39950: URL: https://github.com/apache/spark/pull/39950#issuecomment-1520563772 Yea @yabola is correct, if we have 100 row groups in a file and there are 100 tasks to read them, each task will only be assigned a range (e.g., a single row group) in the file to read, so it won't read metadata for all the row groups in the file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dzhigimont commented on a diff in pull request #40665: [SPARK-42621][PS] Add inclusive parameter for pd.date_range
dzhigimont commented on code in PR #40665: URL: https://github.com/apache/spark/pull/40665#discussion_r1175559912 ## python/pyspark/pandas/namespace.py: ## @@ -1782,12 +1780,8 @@ def date_range( Normalize start/end dates to midnight before generating date range. name : str, default None Name of the resulting DatetimeIndex. -closed : {None, 'left', 'right'}, optional -Make the interval closed with respect to the given frequency to -the 'left', 'right', or both sides (None, the default). - -.. deprecated:: 3.4.0 - +inclusive : {"both", "neither", "left", "right"}, default "both" Review Comment: Sure, versionadded has been added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] DerekTBrown closed pull request #40798: SPARK-43166: name docker users
DerekTBrown closed pull request #40798: SPARK-43166: name docker users URL: https://github.com/apache/spark/pull/40798 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] DerekTBrown commented on pull request #40798: SPARK-43166: name docker users
DerekTBrown commented on PR #40798: URL: https://github.com/apache/spark/pull/40798#issuecomment-1520512047 Looks good. Closing in favor of #40831 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dzhigimont commented on a diff in pull request #40370: [SPARK-42620][PS] Add `inclusive` parameter for (DataFrame|Series).between_time
dzhigimont commented on code in PR #40370: URL: https://github.com/apache/spark/pull/40370#discussion_r1175552369 ## python/pyspark/pandas/frame.py: ## @@ -3582,14 +3571,18 @@ def between_time( if not isinstance(self.index, ps.DatetimeIndex): raise TypeError("Index must be DatetimeIndex") +if inclusive not in ["left", "right", "both", "neither"]: +msg = "Inclusive has to be either 'both', 'neither', 'left' or 'right'" +raise ValueError(msg) Review Comment: Changed to PySparkValueError and the new error_class has been added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dzhigimont commented on a diff in pull request #40370: [SPARK-42620][PS] Add `inclusive` parameter for (DataFrame|Series).between_time
dzhigimont commented on code in PR #40370: URL: https://github.com/apache/spark/pull/40370#discussion_r1175551612 ## python/pyspark/pandas/frame.py: ## @@ -3519,16 +3516,8 @@ def between_time( Initial time as a time filter limit. end_time : datetime.time or str End time as a time filter limit. -include_start : bool, default True -Whether the start time needs to be included in the result. - -.. deprecated:: 3.4.0 - -include_end : bool, default True -Whether the end time needs to be included in the result. - -.. deprecated:: 3.4.0 - +inclusive : {"both", "neither", "left", "right"}, default "both" +Include boundaries; whether to set each bound as closed or open. Review Comment: Added ## python/pyspark/pandas/series.py: ## @@ -6836,16 +6833,8 @@ def between_time( Initial time as a time filter limit. end_time : datetime.time or str End time as a time filter limit. -include_start : bool, default True -Whether the start time needs to be included in the result. - -.. deprecated:: 3.4.0 - -include_end : bool, default True -Whether the end time needs to be included in the result. - -.. deprecated:: 3.4.0 - +inclusive : {"both", "neither", "left", "right"}, default "both" +Include boundaries; whether to set each bound as closed or open. Review Comment: Added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] RyanBerti commented on pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch
RyanBerti commented on PR #40615: URL: https://github.com/apache/spark/pull/40615#issuecomment-1520502797 >about adding a third boolean argument, with the default value being false -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] RyanBerti commented on a diff in pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch
RyanBerti commented on code in PR #40615: URL: https://github.com/apache/spark/pull/40615#discussion_r1175544253 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/DatasketchesHllSketchSuite.scala: ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import scala.collection.immutable.NumericRange +import scala.util.Random + +import org.apache.datasketches.hll.HllSketch +import org.apache.datasketches.memory.Memory + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{BoundReference, HllSketchEstimate} +import org.apache.spark.sql.types.{BinaryType, DataType, IntegerType, LongType, StringType} +import org.apache.spark.unsafe.types.UTF8String + + +class DatasketchesHllSketchSuite extends SparkFunSuite { Review Comment: Ok, I'll go ahead and remove tgtHllType as a parameter. I was of the mindset that the Spark API should match the Datasketches HllSketch API, but we can slim down the params supported by Spark for ease of use. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #40919: [SPARK-43204][SQL] Align MERGE assignments with table attributes
aokolnychyi commented on PR #40919: URL: https://github.com/apache/spark/pull/40919#issuecomment-1520498405 @cloud-fan @sunchao @viirya @huaxingao @dongjoon-hyun @gengliangwang, this is a follow-up to PR #40308. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] RyanBerti commented on a diff in pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch
RyanBerti commented on code in PR #40615: URL: https://github.com/apache/spark/pull/40615#discussion_r1175541542 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala: ## @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + + +import java.util.Locale + +import org.apache.datasketches.SketchesArgumentException +import org.apache.datasketches.hll.{HllSketch, TgtHllType, Union} +import org.apache.datasketches.memory.Memory + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, ExpressionDescription, Literal} +import org.apache.spark.sql.catalyst.trees.{BinaryLike, TernaryLike} +import org.apache.spark.sql.types.{AbstractDataType, AnyDataType, BinaryType, DataType, IntegerType, LongType, StringType} +import org.apache.spark.unsafe.types.UTF8String + + +/** + * The HllSketchAgg function utilizes a Datasketches HllSketch instance to + * probabilistically count the number of unique values in a given column, and + * outputs the binary representation of the HllSketch. + * + * See [[https://datasketches.apache.org/docs/HLL/HLL.html]] for more information + * + * @param child child expression against which unique counting will occur + * @param lgConfigK the log-base-2 of K, where K is the number of buckets or slots for the sketch + * @param tgtHllType the target type of the HllSketch to be used (HLL_4, HLL_6, HLL_8) + */ +@ExpressionDescription( + usage = """ +_FUNC_(expr, lgConfigK, tgtHllType) - Returns the HllSketch's compact binary representation. + `lgConfigK` (optional) the log-base-2 of K, with K = the number of buckets for the HllSketch. + `tgtHllType` (optional) the target type of the HllSketch (HLL_4, HLL_6, HLL_8). """, + examples = """ +Examples: + > SELECT hll_sketch_estimate(_FUNC_(col1, 12, 'HLL_4')) + FROM VALUES (1), (1), (2), (2), (3) tab(col1); + 3 + """, + group = "agg_funcs", + since = "3.5.0") +case class HllSketchAgg( +child: Expression, +lgConfigKExpression: Expression, +tgtHllTypeExpression: Expression, +mutableAggBufferOffset: Int = 0, +inputAggBufferOffset: Int = 0) + extends TypedImperativeAggregate[HllSketch] with TernaryLike[Expression] with ExpectsInputTypes { + + // Hllsketch config - mark as lazy so that they're not evaluated during tree transformation. + + lazy val lgConfigK: Int = second.eval().asInstanceOf[Int] + lazy val tgtHllType: TgtHllType = try { + TgtHllType.valueOf(third.eval().asInstanceOf[UTF8String].toString.toUpperCase(Locale.ROOT)) + } catch { +case _: IllegalArgumentException => + throw new SketchesArgumentException("Invalid tgtHllType value") + } + + // Constructors + + def this(child: Expression) = { +this(child, Literal(HllSketch.DEFAULT_LG_K), Literal(HllSketch.DEFAULT_HLL_TYPE.toString), 0, 0) + } + + def this(child: Expression, lgConfigK: Expression) = { +this(child, lgConfigK, Literal(HllSketch.DEFAULT_HLL_TYPE.toString), 0, 0) + } + + def this(child: Expression, lgConfigK: Int) = { +this(child, Literal(lgConfigK), Literal(HllSketch.DEFAULT_HLL_TYPE.toString), 0, 0) + } + + def this(child: Expression, lgConfigK: Expression, tgtHllType: Expression) = { +this(child, lgConfigK, tgtHllType, 0, 0) + } + + def this(child: Expression, lgConfigK: Int, tgtHllType: String) = { +this(child, Literal(lgConfigK), Literal(tgtHllType), 0, 0) + } + + // Copy constructors required by ImperativeAggregate + + override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): HllSketchAgg = +copy(mutableAggBufferOffset = newMutableAggBufferOffset) + + override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): HllSketchAgg = +copy(inputAggBufferOffset = newInputAggBufferOffset) + + override protected def withNewChildrenInternal(newFirst: Expression, + newSecond: Expression, + newThird: Expression): HllSketchAgg = +copy(child = newFirst,
[GitHub] [spark] amaliujia commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module
amaliujia commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1175520012 ## common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala: ## @@ -34,7 +33,7 @@ private[spark] object ErrorMessageFormat extends Enumeration { */ private[spark] object SparkThrowableHelper { val errorReader = new ErrorClassesJsonReader( -Seq(Utils.getSparkClassLoader.getResource("error/error-classes.json"))) +Seq(getClass.getClassLoader.getResource("error/error-classes.json"))) Review Comment: Yes Utils is a bit big and I need to tackle that separately with a strategy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats
pan3793 commented on PR #40920: URL: https://github.com/apache/spark/pull/40920#issuecomment-1520446235 cc @sunchao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module
hvanhovell commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1175497831 ## common/utils/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala: ## @@ -30,6 +29,7 @@ import org.apache.commons.text.StringSubstitutor import org.apache.spark.annotation.DeveloperApi + Review Comment: Remove line -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module
hvanhovell commented on code in PR #40931: URL: https://github.com/apache/spark/pull/40931#discussion_r1175496791 ## common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala: ## @@ -34,7 +33,7 @@ private[spark] object ErrorMessageFormat extends Enumeration { */ private[spark] object SparkThrowableHelper { val errorReader = new ErrorClassesJsonReader( -Seq(Utils.getSparkClassLoader.getResource("error/error-classes.json"))) +Seq(getClass.getClassLoader.getResource("error/error-classes.json"))) Review Comment: We should move part of the `Utils` as well. We should do that in a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #37630: [SPARK-40193][SQL] Merge subquery plans with different filters
peter-toth commented on PR #37630: URL: https://github.com/apache/spark/pull/37630#issuecomment-1520435320 I extracted the first commit of this PR, that just moves `MergeScalarSubqueries` from `spark-catalyst` to `spark-sql` to https://github.com/apache/spark/pull/40932 to make the actual change of this PR clearer once https://github.com/apache/spark/pull/40932 is merged. https://github.com/apache/spark/pull/40932 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth opened a new pull request, #40932: [SPARK-43266][SQL] Move MergeScalarSubqueries to spark-sql
peter-toth opened a new pull request, #40932: URL: https://github.com/apache/spark/pull/40932 ### What changes were proposed in this pull request? This PR moves `MergeScalarSubqueries` from `catalyst` to `spark-sql` ### Why are the changes needed? Make SPARK-40193 / https://github.com/apache/spark/pull/37630 easier. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #40879: [SPARK-43217] Correctly recurse in nested maps/arrays in findNestedField
cloud-fan closed pull request #40879: [SPARK-43217] Correctly recurse in nested maps/arrays in findNestedField URL: https://github.com/apache/spark/pull/40879 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40879: [SPARK-43217] Correctly recurse in nested maps/arrays in findNestedField
cloud-fan commented on PR #40879: URL: https://github.com/apache/spark/pull/40879#issuecomment-1520408175 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module
amaliujia commented on PR #40931: URL: https://github.com/apache/spark/pull/40931#issuecomment-1520403075 @cloud-fan @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #40931: [SPARK-43265] Move Error framework to a common utils module
amaliujia opened a new pull request, #40931: URL: https://github.com/apache/spark/pull/40931 ### What changes were proposed in this pull request? Move Error framework to a common utils module so that we can share it between Spark and Spark Connect without introducing heavy dependencies on Spark Connect module. ### Why are the changes needed? Reduce Dependencies on Spark Connect. ### Does this PR introduce _any_ user-facing change? Error framework is internally API so this should be fine. ### How was this patch tested? Existing UT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175452917 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technically, a file format could choose suppress them, but that is not recommended. */ def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS + + /** + * The extractors to use when deriving file-constant metadata columns for this file format. + * + * By default, the value of a file-constant metadata column is obtained by looking up the column's + * name in the file's metadata column value map. However, implementations can override this method + * in order to provide an extractor that has access to the entire [[PartitionedFile]] when + * deriving the column's value. + * + * NOTE: Extractors are lazy, invoked only if the query actually selects their column at runtime. + * + * See also [[FileFormat.getFileConstantMetadataColumnValue]]. + */ + def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] = Review Comment: > Can we add more comments in the new API, saying that it may only handle part of the constant metadata columns? The rest of them should be handled by a custom FileIndex. There was already one comment to that effect in the PR that added custom metadata column support: https://github.com/apache/spark/pull/40677/files#diff-2d7ec6604810e1be58ca7852ec389853c143b684894f2390ffe07f96701d2219R29 And the unit tests also are a pretty clear demonstration. Good point that the comments are buried pretty deep tho -- I'll try to add/expand them in FileFormat.scala as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175449282 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technically, a file format could choose suppress them, but that is not recommended. */ def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS + + /** + * The extractors to use when deriving file-constant metadata columns for this file format. + * + * By default, the value of a file-constant metadata column is obtained by looking up the column's + * name in the file's metadata column value map. However, implementations can override this method + * in order to provide an extractor that has access to the entire [[PartitionedFile]] when + * deriving the column's value. + * + * NOTE: Extractors are lazy, invoked only if the query actually selects their column at runtime. + * + * See also [[FileFormat.getFileConstantMetadataColumnValue]]. + */ + def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] = Review Comment: > Basically, we get the value of a constant metadata column from two maps: one map is filled by FileIndex and contains the value directly. The other map is filled by the FileFormat implementation and contains extractors. Nit: The extractors don't "contain" the value at all. They just tell how to extract a value from the `PartitionedFile` that the `FileIndex` provides. I don't know a cleaner way to do this, given that the extraction policy should probably be consistent across all files, rather than re-defined by each file? (prior to this PR, the "policy" was hard-wired as "look it up in the map"; now it's allowed to also be "do arbitrary computation on `PartitionedFile`, if the user prefers) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats
pan3793 commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175444671 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.length > threshold) { val hadoopConf = spark.sessionState.newHadoopConf() val serializableConfiguration = new SerializableConfiguration(hadoopConf) - val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray + val locations = partitionSpecsAndLocs.map(_._2).toArray Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats
pan3793 commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175443729 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.length > threshold) { val hadoopConf = spark.sessionState.newHadoopConf() val serializableConfiguration = new SerializableConfiguration(hadoopConf) - val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray + val locations = partitionSpecsAndLocs.map(_._2).toArray // Set the number of parallelism to prevent following file listing from generating many tasks // in case of large #defaultParallelism. - val numParallelism = Math.min(serializedPaths.length, + val numParallelism = Math.min(locations.length, Review Comment: Yes, changed to evaluate once as `partitionNum` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats
pan3793 commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175442719 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.length > threshold) { val hadoopConf = spark.sessionState.newHadoopConf() val serializableConfiguration = new SerializableConfiguration(hadoopConf) - val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray + val locations = partitionSpecsAndLocs.map(_._2) // Set the number of parallelism to prevent following file listing from generating many tasks // in case of large #defaultParallelism. - val numParallelism = Math.min(serializedPaths.length, + val numParallelism = Math.min(locations.length, Math.min(spark.sparkContext.defaultParallelism, 1)) // gather the fast stats for all the partitions otherwise Hive metastore will list all the // files for all the new partitions in sequential way, which is super slow. logInfo(s"Gather the fast stats in parallel using $numParallelism tasks.") - spark.sparkContext.parallelize(serializedPaths, numParallelism) -.mapPartitions { paths => + spark.sparkContext.parallelize(locations, numParallelism) +.mapPartitions { locationsEachPartition => val pathFilter = getPathFilter(serializableConfiguration.value) - paths.map(new Path(_)).map{ path => -val fs = path.getFileSystem(serializableConfiguration.value) -val statuses = fs.listStatus(path, pathFilter) -(path.toString, PartitionStatistics(statuses.length, statuses.map(_.getLen).sum)) + locationsEachPartition.map { location => +val fs = location.getFileSystem(serializableConfiguration.value) +val statuses = fs.listStatus(location, pathFilter) +(location.toString, PartitionStatistics(statuses.length, statuses.map(_.getLen).sum)) Review Comment: Looks simpler, adopted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] majdyz commented on pull request #40929: [SPARK-43264][SQL] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader
majdyz commented on PR #40929: URL: https://github.com/apache/spark/pull/40929#issuecomment-1520371824 @LuciferYang Thanks, I think it's already been enabled now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] majdyz closed pull request #40929: [SPARK-43264][SQL] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader
majdyz closed pull request #40929: [SPARK-43264][SQL] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader URL: https://github.com/apache/spark/pull/40929 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata
ryan-johnson-databricks commented on code in PR #40885: URL: https://github.com/apache/spark/pull/40885#discussion_r1175437133 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala: ## @@ -203,6 +203,21 @@ trait FileFormat { * method. Technically, a file format could choose suppress them, but that is not recommended. */ def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS + + /** + * The extractors to use when deriving file-constant metadata columns for this file format. + * + * By default, the value of a file-constant metadata column is obtained by looking up the column's + * name in the file's metadata column value map. However, implementations can override this method + * in order to provide an extractor that has access to the entire [[PartitionedFile]] when + * deriving the column's value. + * + * NOTE: Extractors are lazy, invoked only if the query actually selects their column at runtime. + * + * See also [[FileFormat.getFileConstantMetadataColumnValue]]. + */ + def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] = Review Comment: I made a dummy PR https://github.com/apache/spark/pull/40930 to show what the other approach actually looks like. I still think it is uglier code for everyone involved, with no benefit to safety, but at least this way we have something concrete to compare. The change for the other PR vs master is: ``` sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala | 33 +--- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala | 145 -- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala | 70 +++ sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala | 10 - sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala | 4 ++-- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCustomMetadataStructSuite.scala | 71 6 files changed, 196 insertions(+), 137 deletions(-) ``` The change of this PR vs. master touches fewer files and is ~50LoC smaller: ``` sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala | 33 +++-- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala | 102 -- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala | 70 +- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala | 10 +- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCustomMetadataStructSuite.scala | 35 ++- 5 files changed, 143 insertions(+), 107 deletions(-) ``` The other PR vs. this one is changes 3 files and ~100LoC: ``` sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala | 99 --- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala | 4 ++-- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCustomMetadataStructSuite.scala | 70 -- 3 files changed, 98 insertions(+), 75 deletions(-) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40847: [SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml`
LuciferYang commented on PR #40847: URL: https://github.com/apache/spark/pull/40847#issuecomment-1520363712 More 1. The conclusion using hadoop 3.0.x and hadoop 3.1.x is the same 2. User hadoop 3.2.x can't build `hadoop-cloud` module too 3. Currently, only hadoop 3.3. x can build all modules -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ryan-johnson-databricks opened a new pull request, #40930: [DO NOT MERGE] File constant metadata extractors split
ryan-johnson-databricks opened a new pull request, #40930: URL: https://github.com/apache/spark/pull/40930 ### What changes were proposed in this pull request? Experimental PR in response to https://github.com/apache/spark/pull/40885#discussion_r1174277575, so that reviewers can visualize the difference between the two approaches discussed there. NOT FOR MERGE -- If this approach is chosen, I will update the other PR accordingly. Only the last commit differs from the other PR. ### Why are the changes needed? N/A ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats
LuciferYang commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175425968 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.length > threshold) { val hadoopConf = spark.sessionState.newHadoopConf() val serializableConfiguration = new SerializableConfiguration(hadoopConf) - val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray + val locations = partitionSpecsAndLocs.map(_._2) // Set the number of parallelism to prevent following file listing from generating many tasks // in case of large #defaultParallelism. - val numParallelism = Math.min(serializedPaths.length, + val numParallelism = Math.min(locations.length, Math.min(spark.sparkContext.defaultParallelism, 1)) // gather the fast stats for all the partitions otherwise Hive metastore will list all the // files for all the new partitions in sequential way, which is super slow. logInfo(s"Gather the fast stats in parallel using $numParallelism tasks.") - spark.sparkContext.parallelize(serializedPaths, numParallelism) -.mapPartitions { paths => + spark.sparkContext.parallelize(locations, numParallelism) +.mapPartitions { locationsEachPartition => val pathFilter = getPathFilter(serializableConfiguration.value) - paths.map(new Path(_)).map{ path => -val fs = path.getFileSystem(serializableConfiguration.value) -val statuses = fs.listStatus(path, pathFilter) -(path.toString, PartitionStatistics(statuses.length, statuses.map(_.getLen).sum)) + locationsEachPartition.map { location => +val fs = location.getFileSystem(serializableConfiguration.value) +val statuses = fs.listStatus(location, pathFilter) +(location.toString, PartitionStatistics(statuses.length, statuses.map(_.getLen).sum)) Review Comment: How about change the return type to `Map[Path, PartitionStatistics]`, not sure if it would be better -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats
LuciferYang commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175422370 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.length > threshold) { val hadoopConf = spark.sessionState.newHadoopConf() val serializableConfiguration = new SerializableConfiguration(hadoopConf) - val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray + val locations = partitionSpecsAndLocs.map(_._2).toArray // Set the number of parallelism to prevent following file listing from generating many tasks // in case of large #defaultParallelism. - val numParallelism = Math.min(serializedPaths.length, + val numParallelism = Math.min(locations.length, Review Comment: Are `locations.length` and `partitionSpecsAndLocs.length` equal? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats
LuciferYang commented on code in PR #40920: URL: https://github.com/apache/spark/pull/40920#discussion_r1175420930 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -789,22 +789,22 @@ case class RepairTableCommand( if (partitionSpecsAndLocs.length > threshold) { val hadoopConf = spark.sessionState.newHadoopConf() val serializableConfiguration = new SerializableConfiguration(hadoopConf) - val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray + val locations = partitionSpecsAndLocs.map(_._2).toArray Review Comment: Can we remove `toArray`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org