[GitHub] [spark] khalidmammadov commented on a diff in pull request #40916: [SPARK-43243][PYTHON][CONNECT] Add level param to printSchema for Python

2023-04-24 Thread via GitHub


khalidmammadov commented on code in PR #40916:
URL: https://github.com/apache/spark/pull/40916#discussion_r1176026841


##
python/pyspark/sql/dataframe.py:
##
@@ -584,14 +584,14 @@ def printSchema(self, level: Optional[int] = None) -> 
None:
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
-.. versionchanged:: 3.5.0
-Added Level parameter.
-
 Parameters
 --
 level : int, optional, default None
 How many levels to print for nested schemas.
 
+.. versionchanged:: 3.5.0

Review Comment:
   Good spot! fixed. Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


cloud-fan commented on code in PR #40931:
URL: https://github.com/apache/spark/pull/40931#discussion_r1176020421


##
common/utils/src/main/scala/org/apache/spark/JsonProtocol.scala:
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import java.io.ByteArrayOutputStream
+import java.nio.charset.StandardCharsets
+
+import com.fasterxml.jackson.core.{JsonEncoding, JsonGenerator}
+import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
+import com.fasterxml.jackson.module.scala.DefaultScalaModule
+
+
+private[spark] object JsonProtocolUtil {

Review Comment:
   the name could be just `JsonUtils`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


cloud-fan commented on code in PR #40931:
URL: https://github.com/apache/spark/pull/40931#discussion_r1176020179


##
common/utils/src/main/scala/org/apache/spark/JsonProtocol.scala:
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import java.io.ByteArrayOutputStream
+import java.nio.charset.StandardCharsets
+
+import com.fasterxml.jackson.core.{JsonEncoding, JsonGenerator}
+import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper}
+import com.fasterxml.jackson.module.scala.DefaultScalaModule
+
+
+private[spark] object JsonProtocolUtil {

Review Comment:
   should this live in the `org.apache.spark.util` package? and can we call 
this util function in `org.apache.spark.util.JsonProtocol.toJsonString` to 
avoid duplicated code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


cloud-fan commented on code in PR #40931:
URL: https://github.com/apache/spark/pull/40931#discussion_r1176019274


##
common/utils/src/main/scala/org/apache/spark/SparkException.scala:
##
@@ -18,14 +18,13 @@
 package org.apache.spark
 
 import java.io.FileNotFoundException
+import java.nio.file.FileAlreadyExistsException
 import java.sql.{SQLException, SQLFeatureNotSupportedException}
 import java.time.DateTimeException
 import java.util.ConcurrentModificationException
 
 import scala.collection.JavaConverters._
 
-import org.apache.hadoop.fs.FileAlreadyExistsException

Review Comment:
   is this a breaking change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


amaliujia commented on code in PR #40931:
URL: https://github.com/apache/spark/pull/40931#discussion_r1176017207


##
project/MimaExcludes.scala:
##
@@ -66,6 +66,12 @@ object MimaExcludes {
 ProblemFilters.exclude[Problem]("org.sparkproject.spark_core.protobuf.*"),
 
ProblemFilters.exclude[Problem]("org.apache.spark.status.protobuf.StoreTypes*"),
 
+// SPARK-43265: Move Error framework to a common utils module
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.QueryContext"),
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException"),
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException$"),
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkThrowable"),

Review Comment:
   I don't think I renamed the package name. 
   
   This throws from spark-core? Unless spark-core builds the uber jar and this 
checks the uber jar otherwise it should throw?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


amaliujia commented on code in PR #40931:
URL: https://github.com/apache/spark/pull/40931#discussion_r1176017207


##
project/MimaExcludes.scala:
##
@@ -66,6 +66,12 @@ object MimaExcludes {
 ProblemFilters.exclude[Problem]("org.sparkproject.spark_core.protobuf.*"),
 
ProblemFilters.exclude[Problem]("org.apache.spark.status.protobuf.StoreTypes*"),
 
+// SPARK-43265: Move Error framework to a common utils module
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.QueryContext"),
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException"),
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException$"),
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkThrowable"),

Review Comment:
   I don't think I renamed the package name. 
   
   This throws from spark-core? Unless spark-core builds the iber jar and this 
checks the uber jar otherwise it should throw?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


cloud-fan commented on code in PR #40931:
URL: https://github.com/apache/spark/pull/40931#discussion_r1176015036


##
project/MimaExcludes.scala:
##
@@ -66,6 +66,12 @@ object MimaExcludes {
 ProblemFilters.exclude[Problem]("org.sparkproject.spark_core.protobuf.*"),
 
ProblemFilters.exclude[Problem]("org.apache.spark.status.protobuf.StoreTypes*"),
 
+// SPARK-43265: Move Error framework to a common utils module
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.QueryContext"),
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException"),
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException$"),
+
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkThrowable"),

Review Comment:
   Did we rename the package? Otherwise it should be still binary compatible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


amaliujia commented on PR #40931:
URL: https://github.com/apache/spark/pull/40931#issuecomment-1521157761

   `continuous-integration/appveyor/pr` seems to be stuck.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40892: [SPARK-43128][CONNECT] Make `recentProgress` and `lastProgress` return `StreamingQueryProgress` consistent with the native Scal

2023-04-24 Thread via GitHub


LuciferYang commented on code in PR #40892:
URL: https://github.com/apache/spark/pull/40892#discussion_r1175980270


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryProgressSuite.scala:
##
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import scala.jdk.CollectionConverters._
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
+import org.apache.spark.sql.connect.client.util.ConnectFunSuite
+import org.apache.spark.sql.types.StructType
+
+class StreamingQueryProgressSuite extends ConnectFunSuite {
+  test("test seder StreamingQueryProgress from json") {
+val jsonStringFromServerSide =
+  s"""
+ |{
+ |  "id" : "33ac26f4-1c39-46ce-b798-f3d2a21211d4",
+ |  "runId" : "849c2c9a-b9f8-446f-9180-259a60fd888c",
+ |  "name" : "myName",
+ |  "timestamp" : "2016-12-05T20:54:20.827Z",
+ |  "batchId" : 2,
+ |  "batchDuration" : 0,
+ |  "durationMs" : {
+ |"total" : 0
+ |  },
+ |  "eventTime" : {
+ |"min" : "2016-12-05T20:54:20.827Z",
+ |"avg" : "2016-12-05T20:54:20.827Z",
+ |"watermark" : "2016-12-05T20:54:20.827Z",
+ |"max" : "2016-12-05T20:54:20.827Z"
+ |  },
+ |  "stateOperators" : [ {
+ |"operatorName" : "op1",
+ |"numRowsTotal" : 0,
+ |"numRowsUpdated" : 1,
+ |"allUpdatesTimeMs" : 1,
+ |"numRowsRemoved" : 2,
+ |"allRemovalsTimeMs" : 34,
+ |"commitTimeMs" : 23,
+ |"memoryUsedBytes" : 3,
+ |"numRowsDroppedByWatermark" : 0,
+ |"numShufflePartitions" : 2,
+ |"numStateStoreInstances" : 2,
+ |"customMetrics" : {
+ |  "stateOnCurrentVersionSizeBytes" : 2,
+ |  "loadedMapCacheHitCount" : 1,
+ |  "loadedMapCacheMissCount" : 0
+ |}
+ |  } ],
+ |  "sources" : [ {
+ |"description" : "source",
+ |"startOffset" : "123",
+ |"endOffset" : "456",
+ |"latestOffset" : "789",
+ |"numInputRows" : 678,
+ |"inputRowsPerSecond" : 10.0,
+ |"processedRowsPerSecond" : "Infinity",
+ |"metrics" : { }
+ |  } ],
+ |  "sink" : {
+ |"description" : "sink",
+ |"numOutputRows" : -1,
+ |"metrics" : { }
+ |  },
+ |  "observedMetrics" : {
+ |"event1" : {
+ |  "values" : [ 1, 3.0 ],
+ |  "schema" : {
+ |"type" : "struct",
+ |"fields" : [ {
+ |  "name" : "c1",
+ |  "type" : "long",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |}, {
+ |  "name" : "c2",
+ |  "type" : "double",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |} ]
+ |  }
+ |},
+ |"event2" : {
+ |  "values" : [ 1, "hello", "world" ],
+ |  "schema" : {
+ |"type" : "struct",
+ |"fields" : [ {
+ |  "name" : "rc",
+ |  "type" : "long",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |}, {
+ |  "name" : "min_q",
+ |  "type" : "string",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |}, {
+ |  "name" : "max_q",
+ |  "type" : "string",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |} ]
+ |  }
+ |}
+ |  }
+ |}
+  """.stripMargin.trim
+
+// scalastyle:off
+println(jsonStringFromServerSide)
+
+val result = StreamingQueryProgress.fromJson(jsonStringFromServerSide)
+assert(result.id.toString === "33ac26f4-1c39-46ce-b798-f3d2a21211d4")

[GitHub] [spark] zhengruifeng commented on pull request #40927: [SPARK-42419][FOLLOWUP][CONNECT][PYTHON] Remove unused exception

2023-04-24 Thread via GitHub


zhengruifeng commented on PR #40927:
URL: https://github.com/apache/spark/pull/40927#issuecomment-152978

   oh, the linter fails


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError

2023-04-24 Thread via GitHub


zhengruifeng commented on code in PR #40617:
URL: https://github.com/apache/spark/pull/40617#discussion_r1175974399


##
python/pyspark/errors/error_classes.py:
##
@@ -79,11 +114,26 @@
   "returnType can not be specified when `` is a user-defined 
function, but got ."
 ]
   },
+  "CANNOT_UNPERSIST_BROADCAST": {
+"message": [
+  "Broadcast can only be unpersisted in driver."
+]
+  },
   "COLUMN_IN_LIST": {
 "message": [
   "`` does not allow a Column in a list."
 ]
   },
+  "CONTEXT_ONLY_VALID_ON_DRIVER" : {
+"message" : [
+  "It appears that you are attempting to reference SparkContext from a 
broadcast variable, action, or transformation. SparkContext can only be used on 
the driver, not in code that it run on workers. For more information, see 
SPARK-5063."
+]
+  },
+  "CONTEXT_UNAVAILABLE_FOR_REMOTE_CLIENT" : {
+"message" : [
+  "Remote client cannot create a SparkContext. Create SparkSession 
instead."

Review Comment:
   NVM, I feel it says that `Remote client cannot create a SparkContext.  So 
the SparkConnect Python Client create a session instead`, while it actually 
raise a error and fails.
   Not a big deal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40936: [MINOR][CONNECT] Remove unnecessary creation of `planner` in `handleWriteOperation` and `handleWriteOperationV2`

2023-04-24 Thread via GitHub


zhengruifeng commented on PR #40936:
URL: https://github.com/apache/spark/pull/40936#issuecomment-1521106087

   thanks @LuciferYang for review, merged into master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40936: [MINOR][CONNECT] Remove unnecessary creation of `planner` in `handleWriteOperation` and `handleWriteOperationV2`

2023-04-24 Thread via GitHub


zhengruifeng closed pull request #40936: [MINOR][CONNECT] Remove unnecessary 
creation of `planner` in `handleWriteOperation` and `handleWriteOperationV2`
URL: https://github.com/apache/spark/pull/40936


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bogao007 commented on a diff in pull request #40892: [SPARK-43128][CONNECT] Make `recentProgress` and `lastProgress` return `StreamingQueryProgress` consistent with the native Scala A

2023-04-24 Thread via GitHub


bogao007 commented on code in PR #40892:
URL: https://github.com/apache/spark/pull/40892#discussion_r1175960363


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryProgressSuite.scala:
##
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import scala.jdk.CollectionConverters._
+
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
+import org.apache.spark.sql.connect.client.util.ConnectFunSuite
+import org.apache.spark.sql.types.StructType
+
+class StreamingQueryProgressSuite extends ConnectFunSuite {
+  test("test seder StreamingQueryProgress from json") {
+val jsonStringFromServerSide =
+  s"""
+ |{
+ |  "id" : "33ac26f4-1c39-46ce-b798-f3d2a21211d4",
+ |  "runId" : "849c2c9a-b9f8-446f-9180-259a60fd888c",
+ |  "name" : "myName",
+ |  "timestamp" : "2016-12-05T20:54:20.827Z",
+ |  "batchId" : 2,
+ |  "batchDuration" : 0,
+ |  "durationMs" : {
+ |"total" : 0
+ |  },
+ |  "eventTime" : {
+ |"min" : "2016-12-05T20:54:20.827Z",
+ |"avg" : "2016-12-05T20:54:20.827Z",
+ |"watermark" : "2016-12-05T20:54:20.827Z",
+ |"max" : "2016-12-05T20:54:20.827Z"
+ |  },
+ |  "stateOperators" : [ {
+ |"operatorName" : "op1",
+ |"numRowsTotal" : 0,
+ |"numRowsUpdated" : 1,
+ |"allUpdatesTimeMs" : 1,
+ |"numRowsRemoved" : 2,
+ |"allRemovalsTimeMs" : 34,
+ |"commitTimeMs" : 23,
+ |"memoryUsedBytes" : 3,
+ |"numRowsDroppedByWatermark" : 0,
+ |"numShufflePartitions" : 2,
+ |"numStateStoreInstances" : 2,
+ |"customMetrics" : {
+ |  "stateOnCurrentVersionSizeBytes" : 2,
+ |  "loadedMapCacheHitCount" : 1,
+ |  "loadedMapCacheMissCount" : 0
+ |}
+ |  } ],
+ |  "sources" : [ {
+ |"description" : "source",
+ |"startOffset" : "123",
+ |"endOffset" : "456",
+ |"latestOffset" : "789",
+ |"numInputRows" : 678,
+ |"inputRowsPerSecond" : 10.0,
+ |"processedRowsPerSecond" : "Infinity",
+ |"metrics" : { }
+ |  } ],
+ |  "sink" : {
+ |"description" : "sink",
+ |"numOutputRows" : -1,
+ |"metrics" : { }
+ |  },
+ |  "observedMetrics" : {
+ |"event1" : {
+ |  "values" : [ 1, 3.0 ],
+ |  "schema" : {
+ |"type" : "struct",
+ |"fields" : [ {
+ |  "name" : "c1",
+ |  "type" : "long",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |}, {
+ |  "name" : "c2",
+ |  "type" : "double",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |} ]
+ |  }
+ |},
+ |"event2" : {
+ |  "values" : [ 1, "hello", "world" ],
+ |  "schema" : {
+ |"type" : "struct",
+ |"fields" : [ {
+ |  "name" : "rc",
+ |  "type" : "long",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |}, {
+ |  "name" : "min_q",
+ |  "type" : "string",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |}, {
+ |  "name" : "max_q",
+ |  "type" : "string",
+ |  "nullable" : true,
+ |  "metadata" : { }
+ |} ]
+ |  }
+ |}
+ |  }
+ |}
+  """.stripMargin.trim
+
+// scalastyle:off
+println(jsonStringFromServerSide)
+
+val result = StreamingQueryProgress.fromJson(jsonStringFromServerSide)
+assert(result.id.toString === "33ac26f4-1c39-46ce-b798-f3d2a21211d4")
+  

[GitHub] [spark] itholic commented on pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError

2023-04-24 Thread via GitHub


itholic commented on PR #40617:
URL: https://github.com/apache/spark/pull/40617#issuecomment-1521086780

   Applied the comments, thanks @zhengruifeng !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError

2023-04-24 Thread via GitHub


itholic commented on code in PR #40617:
URL: https://github.com/apache/spark/pull/40617#discussion_r1175958890


##
python/pyspark/errors/error_classes.py:
##
@@ -79,11 +114,26 @@
   "returnType can not be specified when `` is a user-defined 
function, but got ."
 ]
   },
+  "CANNOT_UNPERSIST_BROADCAST": {
+"message": [
+  "Broadcast can only be unpersisted in driver."
+]
+  },
   "COLUMN_IN_LIST": {
 "message": [
   "`` does not allow a Column in a list."
 ]
   },
+  "CONTEXT_ONLY_VALID_ON_DRIVER" : {
+"message" : [
+  "It appears that you are attempting to reference SparkContext from a 
broadcast variable, action, or transformation. SparkContext can only be used on 
the driver, not in code that it run on workers. For more information, see 
SPARK-5063."
+]
+  },
+  "CONTEXT_UNAVAILABLE_FOR_REMOTE_CLIENT" : {
+"message" : [
+  "Remote client cannot create a SparkContext. Create SparkSession 
instead."

Review Comment:
   Sorry, I didn't catch the point. The message suggests using a Spark session 
because it is not possible to create a SparkContext in a remote session. Do you 
have any proposed suggestion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40430: [SPARK-42798][BUILD] Upgrade protobuf-java to 3.22.3

2023-04-24 Thread via GitHub


LuciferYang commented on PR #40430:
URL: https://github.com/apache/spark/pull/40430#issuecomment-1521078012

   All test passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError

2023-04-24 Thread via GitHub


itholic commented on code in PR #40617:
URL: https://github.com/apache/spark/pull/40617#discussion_r1175953882


##
python/pyspark/errors/error_classes.py:
##
@@ -79,11 +114,26 @@
   "returnType can not be specified when `` is a user-defined 
function, but got ."
 ]
   },
+  "CANNOT_UNPERSIST_BROADCAST": {

Review Comment:
   Sounds good. Seems like we can consolidate `CANNOT_DESTROY_BROADCAST` as 
well :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40738: [SPARK-42380][BUILD] Upgrade Apache Maven to 3.9.1

2023-04-24 Thread via GitHub


LuciferYang commented on PR #40738:
URL: https://github.com/apache/spark/pull/40738#issuecomment-1521073578

   @steveloughran Do you have any other questions? I am planning to close this 
PR and test Apache Maven 3.9.2+ in the future to check if it is possible not to 
add `-Dmaven.resolver.transport=wagon`. Apache Spark can continue to use 3.8.7 
now.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40893: [SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution

2023-04-24 Thread via GitHub


pan3793 commented on PR #40893:
URL: https://github.com/apache/spark/pull/40893#issuecomment-1521073105

   > OK, am I right that this does not make Spark any _less_ compatible with 
any version of Hive that is currently supported (>= 2.3.9)? If so then this is 
fine
   
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] melin commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

2023-04-24 Thread via GitHub


melin commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1521072841

   > @melin I have opened another pr #39114 for this, and unfortunately that 
one was closed because of no feedback for a long time. If necessary, I can 
consider reopening it.
   
   reopen it,thank you
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40847: [SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml`

2023-04-24 Thread via GitHub


LuciferYang commented on PR #40847:
URL: https://github.com/apache/spark/pull/40847#issuecomment-1521071604

   > Interesting, thanks for the detailed analysis @LuciferYang !
   > 
   > > Use hadoop 3.2.x can't build hadoop-cloud module too
   > 
   > This is Hadoop 3.2.2 ? I remember at some point we started to enable 
`hadoop-cloud` in Spark release, so I wonder why this didn't cause any error 
back in the time ..
   
   I test hadoop 3.2.4. `AbortableStreamBasedCheckpointFileManager` was 
introduced in SPARK-40039, and it uses APIs that are only available in Hadoop 
3.3.1+([HADOOP-16906](https://issues.apache.org/jira/browse/HADOOP-16906)  
`FSDataOutputStream#abort()`)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40927: [SPARK-42419][FOLLOWUP][CONNECT][PYTHON] Remove unused exception

2023-04-24 Thread via GitHub


itholic commented on PR #40927:
URL: https://github.com/apache/spark/pull/40927#issuecomment-1521055105

   > what about changing to?
   
   @zhengruifeng Nice! Just applied the comment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40722: [SPARK-43076][PS][CONNECT] Removing the dependency on `grpcio` when remote session is not used.

2023-04-24 Thread via GitHub


itholic commented on PR #40722:
URL: https://github.com/apache/spark/pull/40722#issuecomment-1521051461

   @HyukjinKwon Test passed. There is anything else we should address here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40922: [SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show

2023-04-24 Thread via GitHub


cloud-fan commented on PR #40922:
URL: https://github.com/apache/spark/pull/40922#issuecomment-1521043238

   > would it be possible to not make any changes in Cast and do everything in 
df.show method?
   
   We can by duplicating the code of `Cast`, but I don't think that's a good 
idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jackylee-ch commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

2023-04-24 Thread via GitHub


jackylee-ch commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1521041727

   @melin I have opened another pr #39114 for this, and unfortunately that one 
was closed because of no feedback for a long time. If necessary, I can consider 
reopening it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WweiL commented on pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API

2023-04-24 Thread via GitHub


WweiL commented on PR #40906:
URL: https://github.com/apache/spark/pull/40906#issuecomment-1521041159

   > CheckConnectJvmClientCompatibility
   
   Ah thanks! Removed it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #40936: [MINOR][CONNECT] Remove unnecessary creation of `planner` in `handleWriteOperation` and `handleWriteOperationV2`

2023-04-24 Thread via GitHub


zhengruifeng opened a new pull request, #40936:
URL: https://github.com/apache/spark/pull/40936

   ### What changes were proposed in this pull request?
   Remove unnecessary creation of `planner` in `handleWriteOperation` and 
`handleWriteOperationV2`
   
   
   ### Why are the changes needed?
   `handleWriteOperation` and `handleWriteOperationV2` themselves are the 
methods of planner, no need to create another planner to call 
`transformRelation`
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   existing UTs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40928: [SPARK-43262][CONNECT][SS][PYTHON] Migrate Spark Connect Structured Streaming errors into error class

2023-04-24 Thread via GitHub


zhengruifeng commented on PR #40928:
URL: https://github.com/apache/spark/pull/40928#issuecomment-1521036530

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40928: [SPARK-43262][CONNECT][SS][PYTHON] Migrate Spark Connect Structured Streaming errors into error class

2023-04-24 Thread via GitHub


zhengruifeng closed pull request #40928: [SPARK-43262][CONNECT][SS][PYTHON] 
Migrate Spark Connect Structured Streaming errors into error class
URL: https://github.com/apache/spark/pull/40928


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40617: [SPARK-42992][PYTHON] Introduce PySparkRuntimeError

2023-04-24 Thread via GitHub


zhengruifeng commented on code in PR #40617:
URL: https://github.com/apache/spark/pull/40617#discussion_r1175926004


##
python/pyspark/errors/error_classes.py:
##
@@ -79,11 +114,26 @@
   "returnType can not be specified when `` is a user-defined 
function, but got ."
 ]
   },
+  "CANNOT_UNPERSIST_BROADCAST": {

Review Comment:
   is it possible to combine `CANNOT_UNPERSIST_BROADCAST` 
`CANNOT_REDUCE_BROADCAST` and `CANNOT_DESTROY_BROADCAST` to a single 
`INVALID_BROADCAST_OPERATION` with parameter `operation`



##
python/pyspark/errors/error_classes.py:
##
@@ -29,6 +34,21 @@
   "Attribute `` in provided object `` is not 
callable."
 ]
   },
+  "BARRIER_TASK_CONTEXT_NOT_INITIALIZE": {
+"message": [
+  "Not supported to call `` before initialize 
BarrierTaskContext."
+]
+  },
+  "BROADCAST_VARIABLE_NOT_LOADED": {
+"message": [
+  "Broadcast variable `` not loaded."
+]
+  },
+  "CALL_BEFORE_SESSION_INITIALIZE" : {

Review Comment:
   what about combine `BARRIER_TASK_CONTEXT_NOT_INITIALIZE` and 
`CALL_BEFORE_SESSION_INITIALIZE` to a general `CALL_BEFORE_INITIALIZE` with an 
extra parameter `object`?



##
python/pyspark/errors/error_classes.py:
##
@@ -79,11 +114,26 @@
   "returnType can not be specified when `` is a user-defined 
function, but got ."
 ]
   },
+  "CANNOT_UNPERSIST_BROADCAST": {
+"message": [
+  "Broadcast can only be unpersisted in driver."
+]
+  },
   "COLUMN_IN_LIST": {
 "message": [
   "`` does not allow a Column in a list."
 ]
   },
+  "CONTEXT_ONLY_VALID_ON_DRIVER" : {
+"message" : [
+  "It appears that you are attempting to reference SparkContext from a 
broadcast variable, action, or transformation. SparkContext can only be used on 
the driver, not in code that it run on workers. For more information, see 
SPARK-5063."
+]
+  },
+  "CONTEXT_UNAVAILABLE_FOR_REMOTE_CLIENT" : {
+"message" : [
+  "Remote client cannot create a SparkContext. Create SparkSession 
instead."

Review Comment:
   I think the message is misleading, it implies a `SparkSession` was created.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] melin commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

2023-04-24 Thread via GitHub


melin commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1521033489

   > > Support partition statistics?
   > 
   > @melin I'm working on the supporting of partition statistics update, it 
relies on workers to return detailed partition statistics.
   
   Is it done?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on a diff in pull request #40665: [SPARK-42621][PS] Add inclusive parameter for pd.date_range

2023-04-24 Thread via GitHub


itholic commented on code in PR #40665:
URL: https://github.com/apache/spark/pull/40665#discussion_r1175922744


##
python/pyspark/pandas/tests/test_namespace.py:
##
@@ -221,13 +221,13 @@ def test_date_range(self):
 )
 
 self.assert_eq(
-ps.date_range(start="2017-01-01", end="2017-01-04", closed="left"),
-pd.date_range(start="2017-01-01", end="2017-01-04", closed="left"),
+ps.date_range(start="2017-01-01", end="2017-01-04", 
inclusive="left"),
+pd.date_range(start="2017-01-01", end="2017-01-04", 
inclusive="left"),
 )
 
 self.assert_eq(
-ps.date_range(start="2017-01-01", end="2017-01-04", 
closed="right"),
-pd.date_range(start="2017-01-01", end="2017-01-04", 
closed="right"),
+ps.date_range(start="2017-01-01", end="2017-01-04", 
inclusive="right"),
+pd.date_range(start="2017-01-01", end="2017-01-04", 
inclusive="right"),

Review Comment:
   Can we also have a tests for "both", "neither" and negative case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40927: [SPARK-42419][FOLLOWUP][CONNECT][PYTHON] Remove unused exception

2023-04-24 Thread via GitHub


zhengruifeng commented on PR #40927:
URL: https://github.com/apache/spark/pull/40927#issuecomment-1521023572

   to be more consistent with 
https://github.com/apache/spark/blob/a45affe3c8e7a724aea7dbbc1af08e36001c7540/python/pyspark/sql/column.py#L924-L932
 
   
   what about changing to?
   
   ```
   if isinstance(length, Column):
   length_expr = length._expr
   start_expr = startPos._expr
   elif isinstance(length, int):
   length_expr = LiteralExpression._from_value(length)
   start_expr = LiteralExpression._from_value(startPos)
   else:
   raise PySparkTypeError(
   error_class="NOT_COLUMN_OR_INT",
   message_parameters={"arg_name": "length", "arg_type": 
type(length).__name__},
   )  
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bogao007 commented on a diff in pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API

2023-04-24 Thread via GitHub


bogao007 commented on code in PR #40906:
URL: https://github.com/apache/spark/pull/40906#discussion_r1175914088


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala:
##
@@ -199,6 +205,26 @@ class RemoteStreamingQuery(
 // scalastyle:on println
   }
 
+  override def exception: Option[StreamingQueryException] = {
+val exception = executeQueryCmd(_.setException(true)).getException
+if (exception.hasExceptionMessage) {
+  // TODO(SPARK-43206): Add more information to StreamingQueryException.
+  Some(
+new StreamingQueryException(
+  "",
+  cause = null,
+  "",
+  "",
+  errorClass = "STREAM_FAILED",

Review Comment:
   I see, thanks for the clarification!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40893: [SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution

2023-04-24 Thread via GitHub


srowen commented on PR #40893:
URL: https://github.com/apache/spark/pull/40893#issuecomment-1521012401

   OK, am I right that this does not make Spark any _less_ compatible with any 
version of Hive that is currently supported (>= 2.3.9)? If so then this is fine


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WweiL commented on a diff in pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API

2023-04-24 Thread via GitHub


WweiL commented on code in PR #40906:
URL: https://github.com/apache/spark/pull/40906#discussion_r1175911789


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala:
##
@@ -199,6 +205,26 @@ class RemoteStreamingQuery(
 // scalastyle:on println
   }
 
+  override def exception: Option[StreamingQueryException] = {
+val exception = executeQueryCmd(_.setException(true)).getException
+if (exception.hasExceptionMessage) {
+  // TODO(SPARK-43206): Add more information to StreamingQueryException.
+  Some(
+new StreamingQueryException(
+  "",
+  cause = null,
+  "",
+  "",
+  errorClass = "STREAM_FAILED",

Review Comment:
   I believe it's not yet... The general error handling framework for jvm 
client is not ready yet. And they only include error message and stack trace so 
far. I'm thinking to implement our own proto message to also include the 
offsets, which will be added in next PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message

2023-04-24 Thread via GitHub


aokolnychyi commented on PR #40934:
URL: https://github.com/apache/spark/pull/40934#issuecomment-1521009283

   Thanks for reviewing, @gengliangwang @sunchao @huaxingao @viirya!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40916: [SPARK-43243][PYTHON][CONNECT] Add level param to printSchema for Python

2023-04-24 Thread via GitHub


zhengruifeng commented on code in PR #40916:
URL: https://github.com/apache/spark/pull/40916#discussion_r1175909780


##
python/pyspark/sql/dataframe.py:
##
@@ -584,14 +584,14 @@ def printSchema(self, level: Optional[int] = None) -> 
None:
 .. versionchanged:: 3.4.0
 Supports Spark Connect.
 
-.. versionchanged:: 3.5.0
-Added Level parameter.
-
 Parameters
 --
 level : int, optional, default None
 How many levels to print for nested schemas.
 
+.. versionchanged:: 3.5.0

Review Comment:
   I think the indent is not right, you may refer to 
https://github.com/apache/spark/blob/7f724c3bc7567b0cddc09d5bed11b79879533368/python/pyspark/sql/dataframe.py#L1917-L1921



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bogao007 commented on a diff in pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API

2023-04-24 Thread via GitHub


bogao007 commented on code in PR #40906:
URL: https://github.com/apache/spark/pull/40906#discussion_r1175907637


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala:
##
@@ -199,6 +205,26 @@ class RemoteStreamingQuery(
 // scalastyle:on println
   }
 
+  override def exception: Option[StreamingQueryException] = {
+val exception = executeQueryCmd(_.setException(true)).getException
+if (exception.hasExceptionMessage) {
+  // TODO(SPARK-43206): Add more information to StreamingQueryException.
+  Some(
+new StreamingQueryException(
+  "",
+  cause = null,
+  "",
+  "",
+  errorClass = "STREAM_FAILED",

Review Comment:
   Just curious, is the existing error framework accessible in Spark Connect 
jvm client?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40905: [SPARK-43233] [SS] Add logging for Kafka Batch Reading for topic partition, offset range and task ID

2023-04-24 Thread via GitHub


HeartSaVioR commented on PR #40905:
URL: https://github.com/apache/spark/pull/40905#issuecomment-1520997483

   @siying Thanks for your first contribution to Apache Spark project. Welcome!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #40905: [SPARK-43233] [SS] Add logging for Kafka Batch Reading for topic partition, offset range and task ID

2023-04-24 Thread via GitHub


HeartSaVioR closed pull request #40905:  [SPARK-43233] [SS] Add logging for 
Kafka Batch Reading for topic partition, offset range and task ID
URL: https://github.com/apache/spark/pull/40905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on pull request #40610: [SPARK-42626][CONNECT] Add Destructive Iterator for SparkResult

2023-04-24 Thread via GitHub


ivoson commented on PR #40610:
URL: https://github.com/apache/spark/pull/40610#issuecomment-1520993028

   Hi @hvanhovell , could you please take a look at this PR? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


amaliujia commented on PR #40931:
URL: https://github.com/apache/spark/pull/40931#issuecomment-1520990871

   ok looks like we also have compatibility checks on Spark-core
   ```
   [error] spark-core: Failed binary compatibility check against 
org.apache.spark:spark-core_2.12:3.4.0! Found 4 potential problems (filtered 
4036)
   [error]  * interface org.apache.spark.QueryContext does not have a 
correspondent in current version
   [error]filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.QueryContext")
   [error]  * class org.apache.spark.SparkException does not have a 
correspondent in current version
   [error]filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException")
   [error]  * object org.apache.spark.SparkException does not have a 
correspondent in current version
   [error]filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkException$")
   [error]  * interface org.apache.spark.SparkThrowable does not have a 
correspondent in current version
   [error]filter with: 
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.SparkThrowable")
   ```
   
   Are we ok to update the check?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message

2023-04-24 Thread via GitHub


sunchao commented on PR #40934:
URL: https://github.com/apache/spark/pull/40934#issuecomment-1520990636

   Merged to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao closed pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message

2023-04-24 Thread via GitHub


sunchao closed pull request #40934: [SPARK-43268][SQL] Use proper error classes 
when exceptions are constructed with a message
URL: https://github.com/apache/spark/pull/40934


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] commented on pull request #38496: [SPARK-40708][SQL] Auto update table statistics based on write metrics

2023-04-24 Thread via GitHub


github-actions[bot] commented on PR #38496:
URL: https://github.com/apache/spark/pull/38496#issuecomment-1520984846

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message

2023-04-24 Thread via GitHub


aokolnychyi commented on PR #40934:
URL: https://github.com/apache/spark/pull/40934#issuecomment-1520982569

   The error messages don't seem related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


amaliujia commented on PR #40931:
URL: https://github.com/apache/spark/pull/40931#issuecomment-1520961265

   Trying to exclude the MIMA check for `common-utils`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40905: [SPARK-43233] [SS] Add logging for Kafka Batch Reading for topic partition, offset range and task ID

2023-04-24 Thread via GitHub


HeartSaVioR commented on PR #40905:
URL: https://github.com/apache/spark/pull/40905#issuecomment-1520959155

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on pull request #40922: [SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show

2023-04-24 Thread via GitHub


sadikovi commented on PR #40922:
URL: https://github.com/apache/spark/pull/40922#issuecomment-1520954545

   I think the PR still introduces user-facing changes. 
   Also, would it be possible to not make any changes in Cast and do everything 
in `df.show` method?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #40847: [SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml`

2023-04-24 Thread via GitHub


sunchao commented on PR #40847:
URL: https://github.com/apache/spark/pull/40847#issuecomment-1520953123

   Interesting, thanks for the detailed analysis @LuciferYang !
   
   > Use hadoop 3.2.x can't build hadoop-cloud module too
   
   This is Hadoop 3.2.2 ? I remember at some point we started to enable 
`hadoop-cloud` in Spark release, so I wonder why this didn't cause any error 
back in the time ..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] BeishaoCao-db commented on a diff in pull request #40907: [PYTHON] Implement `__dir__()` in `pyspark.sql.dataframe.DataFrame` to include columns

2023-04-24 Thread via GitHub


BeishaoCao-db commented on code in PR #40907:
URL: https://github.com/apache/spark/pull/40907#discussion_r1175813851


##
python/pyspark/sql/dataframe.py:
##
@@ -3008,6 +3008,25 @@ def __getattr__(self, name: str) -> Column:
 jc = self._jdf.apply(name)
 return Column(jc)
 
+def __dir__(self):
+"""
+Examples
+

Review Comment:
   added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] BeishaoCao-db commented on pull request #40907: [PYTHON] Implement `__dir__()` in `pyspark.sql.dataframe.DataFrame` to include columns

2023-04-24 Thread via GitHub


BeishaoCao-db commented on PR #40907:
URL: https://github.com/apache/spark/pull/40907#issuecomment-1520825250

   > engine that uses `dir()` to generate autocomplete suggestions (e.g. 
IPython kernel, Databricks Notebooks) will suggest column names on the 
completion `df.|`
   
   Sure, create one ASF JIRA: https://issues.apache.org/jira/browse/SPARK-43270


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WweiL opened a new pull request, #40935: [SPARK-43206] [SS] [CONNECT] [DRAFT] [DO-NOT-REVIEW] StreamingQuery exception() include stack trace

2023-04-24 Thread via GitHub


WweiL opened a new pull request, #40935:
URL: https://github.com/apache/spark/pull/40935

   
   
   ### What changes were proposed in this pull request?
   
   Add stack trace to streamingQuery's `exception()` method. Following 
https://github.com/apache/spark/commit/a5c8a3c976889f33595ac18f82e73e6b9fd29b57#diff-98baf452f0352c75a39f39351c5f9e656675810b6d4cfd178f1b0bae9751495b
   
   Add to both python client and scala client
   
   ### Why are the changes needed?
   
   Including stack trace is helpful in debugging
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Manually tested:
   1. Python:
   ```
   JVM stacktrace:
   org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 1.0 (TID 1) (ip-10-110-19-234.us-west-2.compute.internal 
executor driver): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
 File "/home/wei.liu/oss-spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 850, in main
   process()
   ```
   2. Scala: TODO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message

2023-04-24 Thread via GitHub


aokolnychyi commented on code in PR #40934:
URL: https://github.com/apache/spark/pull/40934#discussion_r1175744775


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala:
##
@@ -56,7 +56,10 @@ class NamespaceAlreadyExistsException private(
   }
 
   def this(message: String) = {
-this(message, errorClass = None, messageParameters = Map.empty[String, 
String])
+this(
+  message,
+  errorClass = Some("SCHEMA_ALREADY_EXISTS"),

Review Comment:
   It is a bit unfortunate we have to use the same string literal in multiple 
places. Adding a constant would require a companion object and an import as 
these values are accessed in constructors. I was not sure it would be worth it 
so kept as is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on pull request #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message

2023-04-24 Thread via GitHub


aokolnychyi commented on PR #40934:
URL: https://github.com/apache/spark/pull/40934#issuecomment-1520774365

   cc @gengliangwang @dongjoon-hyun @viirya @huaxingao @sunchao @cloud-fan 
@HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi opened a new pull request, #40934: [SPARK-43268][SQL] Use proper error classes when exceptions are constructed with a message

2023-04-24 Thread via GitHub


aokolnychyi opened a new pull request, #40934:
URL: https://github.com/apache/spark/pull/40934

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR makes sure each exception affected in PR #40679 has a proper error 
class when constructed with an explicit message.
   
   ### Why are the changes needed?
   
   
   These changes are needed per [this 
discussion](https://github.com/apache/spark/pull/40679#discussion_r1159264585).
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   
   ### How was this patch tested?
   
   
   Existing tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pjfanning commented on pull request #40933: [SPARK-43263][BUILD] Upgrade `FasterXML jackson` to 2.15.0

2023-04-24 Thread via GitHub


pjfanning commented on PR #40933:
URL: https://github.com/apache/spark/pull/40933#issuecomment-1520686760

   if you use jackson 2.14.2 - you can just upgrade snakeyaml to 2.x
   
   jackson 2.15.0 has extra changes that you need to worry about - see 
https://issues.apache.org/jira/browse/SPARK-42854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bjornjorgensen commented on pull request #40933: [SPARK-43263][BUILD] Upgrade `FasterXML jackson` to 2.15.0

2023-04-24 Thread via GitHub


bjornjorgensen commented on PR #40933:
URL: https://github.com/apache/spark/pull/40933#issuecomment-1520673919

   CC @pjfanning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bjornjorgensen opened a new pull request, #40933: [SPARK-43263][BUILD] Upgrade `FasterXML jackson` to 2.15.0

2023-04-24 Thread via GitHub


bjornjorgensen opened a new pull request, #40933:
URL: https://github.com/apache/spark/pull/40933

   ### What changes were proposed in this pull request?
   Upgrade FasterXML jackson from 2.14.2 to 2.15.0
   
   ### Why are the changes needed?
   Upgrade Snakeyaml to 2.0 (resolves CVE-2022-1471 [CVE-2022-1471 at 
nist](https://nvd.nist.gov/vuln/detail/CVE-2022-1471)
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


amaliujia commented on PR #40931:
URL: https://github.com/apache/spark/pull/40931#issuecomment-1520666866

   This PR fails on 
   ```
   Error:  running /home/runner/work/spark/spark/dev/mima -Phadoop-3 
-Pdocker-integration-tests -Pyarn -Pkubernetes -Pspark-ganglia-lgpl -Pmesos 
-Pconnect -Phive -Phadoop-cloud -Pkinesis-asl -Phive-thriftserver -Pvolcano ; 
received return code 1
   ```
   
   but there is no error messages about how to fix. I run this locally which 
output nothing as well.
   
   
   Stuck on mima stuff now,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WweiL commented on pull request #40906: [SPARK-43134] [CONNECT] [SS] JVM client StreamingQuery exception() API

2023-04-24 Thread via GitHub


WweiL commented on PR #40906:
URL: https://github.com/apache/spark/pull/40906#issuecomment-1520665915

   Stack trace will be added in this ticket 
https://issues.apache.org/jira/browse/SPARK-43206, after this PR is merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WweiL commented on pull request #40887: [SPARK-43144] Scala Client DataStreamReader table() API

2023-04-24 Thread via GitHub


WweiL commented on PR #40887:
URL: https://github.com/apache/spark/pull/40887#issuecomment-1520622334

   @amaliujia @HyukjinKwon 
   I also changed `ProtoToParsedPlanTestSuite` a little to remove the memory 
addresses, before the change the test for streaming table only would fail with:
   ```
   - streaming_table_API_with_options *** FAILED *** (8 milliseconds)
   [info]   Expected and actual plans do not match:
   [info]   
   [info]   === Expected Plan ===
   [info]   SubqueryAlias primary.tempdb.myStreamingTable
   [info]   +- StreamingRelationV2 primary.tempdb.myStreamingTable, 
org.apache.spark.sql.connector.catalog.InMemoryTable@752725d9, [p1=v1, p2=v2], 
[id#0L], org.apache.spark.sql.connector.catalog.InMemoryCatalog@347d8e2a, 
tempdb.myStreamingTable
   [info]   
   [info]   
   [info]   === Actual Plan ===
   [info]   SubqueryAlias primary.tempdb.myStreamingTable
   [info]   +- StreamingRelationV2 primary.tempdb.myStreamingTable, 
org.apache.spark.sql.connector.catalog.InMemoryTable@a88a5db, [p1=v1, p2=v2], 
[id#0L], org.apache.spark.sql.connector.catalog.InMemoryCatalog@2c6b362e, 
tempdb.myStreamingTable
   ```
   Because the memory address is different every time it runs. I removed these 
in the test suite.
   
   
   And verified that memory addresses doesn't exist in existing explain files:
   ```
   wei.liu:~/oss-spark$ cat 
connector/connect/common/src/test/resources/query-tests/explain-results/* | 
grep @
   wei.liu:~/oss-spark$ 
   ```
   
   PTAL, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1175623971


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##
@@ -203,6 +203,21 @@ trait FileFormat {
* method. Technically, a file format could choose suppress them, but that 
is not recommended.
*/
   def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+  /**
+   * The extractors to use when deriving file-constant metadata columns for 
this file format.
+   *
+   * By default, the value of a file-constant metadata column is obtained by 
looking up the column's
+   * name in the file's metadata column value map. However, implementations 
can override this method
+   * in order to provide an extractor that has access to the entire 
[[PartitionedFile]] when
+   * deriving the column's value.
+   *
+   * NOTE: Extractors are lazy, invoked only if the query actually selects 
their column at runtime.
+   *
+   * See also [[FileFormat.getFileConstantMetadataColumnValue]].
+   */
+  def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] =

Review Comment:
   doc comments updated, PTAL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40899: [SPARK-43249][CONNECT] Fix missing stats for SQL Command

2023-04-24 Thread via GitHub


amaliujia commented on PR #40899:
URL: https://github.com/apache/spark/pull/40899#issuecomment-1520596794

   Thanks for adding the JIRA!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #40893: [SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution

2023-04-24 Thread via GitHub


sunchao commented on PR #40893:
URL: https://github.com/apache/spark/pull/40893#issuecomment-1520573329

   @pan3793 AFAIK the development efforts in Hive community are only in Hive 
3.x/4.x at the moment, and the 2.x branch is barely maintained. I can try to 
start a conversation in the Hive community to have a new 2.3.10 release and see 
how it looks like.
   
   From the long term perspective, it'd be better for Spark to move to Hive 
3.x/4.x.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader

2023-04-24 Thread via GitHub


sunchao commented on PR #39950:
URL: https://github.com/apache/spark/pull/39950#issuecomment-1520563772

   Yea @yabola is correct, if we have 100 row groups in a file and there are 
100 tasks to read them, each task will only be assigned a range (e.g., a single 
row group) in the file to read, so it won't read metadata for all the row 
groups in the file.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dzhigimont commented on a diff in pull request #40665: [SPARK-42621][PS] Add inclusive parameter for pd.date_range

2023-04-24 Thread via GitHub


dzhigimont commented on code in PR #40665:
URL: https://github.com/apache/spark/pull/40665#discussion_r1175559912


##
python/pyspark/pandas/namespace.py:
##
@@ -1782,12 +1780,8 @@ def date_range(
 Normalize start/end dates to midnight before generating date range.
 name : str, default None
 Name of the resulting DatetimeIndex.
-closed : {None, 'left', 'right'}, optional
-Make the interval closed with respect to the given frequency to
-the 'left', 'right', or both sides (None, the default).
-
-.. deprecated:: 3.4.0
-
+inclusive : {"both", "neither", "left", "right"}, default "both"

Review Comment:
   Sure, versionadded has been added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] DerekTBrown closed pull request #40798: SPARK-43166: name docker users

2023-04-24 Thread via GitHub


DerekTBrown closed pull request #40798: SPARK-43166: name docker users
URL: https://github.com/apache/spark/pull/40798


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] DerekTBrown commented on pull request #40798: SPARK-43166: name docker users

2023-04-24 Thread via GitHub


DerekTBrown commented on PR #40798:
URL: https://github.com/apache/spark/pull/40798#issuecomment-1520512047

   Looks good. Closing in favor of #40831 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dzhigimont commented on a diff in pull request #40370: [SPARK-42620][PS] Add `inclusive` parameter for (DataFrame|Series).between_time

2023-04-24 Thread via GitHub


dzhigimont commented on code in PR #40370:
URL: https://github.com/apache/spark/pull/40370#discussion_r1175552369


##
python/pyspark/pandas/frame.py:
##
@@ -3582,14 +3571,18 @@ def between_time(
 if not isinstance(self.index, ps.DatetimeIndex):
 raise TypeError("Index must be DatetimeIndex")
 
+if inclusive not in ["left", "right", "both", "neither"]:
+msg = "Inclusive has to be either 'both', 'neither', 'left' or 
'right'"
+raise ValueError(msg)

Review Comment:
   Changed to PySparkValueError and the new error_class has been added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dzhigimont commented on a diff in pull request #40370: [SPARK-42620][PS] Add `inclusive` parameter for (DataFrame|Series).between_time

2023-04-24 Thread via GitHub


dzhigimont commented on code in PR #40370:
URL: https://github.com/apache/spark/pull/40370#discussion_r1175551612


##
python/pyspark/pandas/frame.py:
##
@@ -3519,16 +3516,8 @@ def between_time(
 Initial time as a time filter limit.
 end_time : datetime.time or str
 End time as a time filter limit.
-include_start : bool, default True
-Whether the start time needs to be included in the result.
-
-.. deprecated:: 3.4.0
-
-include_end : bool, default True
-Whether the end time needs to be included in the result.
-
-.. deprecated:: 3.4.0
-
+inclusive : {"both", "neither", "left", "right"}, default "both"
+Include boundaries; whether to set each bound as closed or open.

Review Comment:
   Added



##
python/pyspark/pandas/series.py:
##
@@ -6836,16 +6833,8 @@ def between_time(
 Initial time as a time filter limit.
 end_time : datetime.time or str
 End time as a time filter limit.
-include_start : bool, default True
-Whether the start time needs to be included in the result.
-
-.. deprecated:: 3.4.0
-
-include_end : bool, default True
-Whether the end time needs to be included in the result.
-
-.. deprecated:: 3.4.0
-
+inclusive : {"both", "neither", "left", "right"}, default "both"
+Include boundaries; whether to set each bound as closed or open.

Review Comment:
   Added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] RyanBerti commented on pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

2023-04-24 Thread via GitHub


RyanBerti commented on PR #40615:
URL: https://github.com/apache/spark/pull/40615#issuecomment-1520502797

   >about adding a third boolean argument, with the default value being false


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] RyanBerti commented on a diff in pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

2023-04-24 Thread via GitHub


RyanBerti commented on code in PR #40615:
URL: https://github.com/apache/spark/pull/40615#discussion_r1175544253


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/DatasketchesHllSketchSuite.scala:
##
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import scala.collection.immutable.NumericRange
+import scala.util.Random
+
+import org.apache.datasketches.hll.HllSketch
+import org.apache.datasketches.memory.Memory
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{BoundReference, 
HllSketchEstimate}
+import org.apache.spark.sql.types.{BinaryType, DataType, IntegerType, 
LongType, StringType}
+import org.apache.spark.unsafe.types.UTF8String
+
+
+class DatasketchesHllSketchSuite extends SparkFunSuite {

Review Comment:
   Ok, I'll go ahead and remove tgtHllType as a parameter. I was of the mindset 
that the Spark API should match the Datasketches HllSketch API, but we can slim 
down the params supported by Spark for ease of use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on pull request #40919: [SPARK-43204][SQL] Align MERGE assignments with table attributes

2023-04-24 Thread via GitHub


aokolnychyi commented on PR #40919:
URL: https://github.com/apache/spark/pull/40919#issuecomment-1520498405

   @cloud-fan @sunchao @viirya @huaxingao @dongjoon-hyun @gengliangwang, this 
is a follow-up to PR #40308.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] RyanBerti commented on a diff in pull request #40615: [SPARK-16484][SQL] Add support for Datasketches HllSketch

2023-04-24 Thread via GitHub


RyanBerti commented on code in PR #40615:
URL: https://github.com/apache/spark/pull/40615#discussion_r1175541542


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala:
##
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+
+import java.util.Locale
+
+import org.apache.datasketches.SketchesArgumentException
+import org.apache.datasketches.hll.{HllSketch, TgtHllType, Union}
+import org.apache.datasketches.memory.Memory
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, ExpressionDescription, Literal}
+import org.apache.spark.sql.catalyst.trees.{BinaryLike, TernaryLike}
+import org.apache.spark.sql.types.{AbstractDataType, AnyDataType, BinaryType, 
DataType, IntegerType, LongType, StringType}
+import org.apache.spark.unsafe.types.UTF8String
+
+
+/**
+ * The HllSketchAgg function utilizes a Datasketches HllSketch instance to
+ * probabilistically count the number of unique values in a given column, and
+ * outputs the binary representation of the HllSketch.
+ *
+ * See [[https://datasketches.apache.org/docs/HLL/HLL.html]] for more 
information
+ *
+ * @param child child expression against which unique counting will occur
+ * @param lgConfigK the log-base-2 of K, where K is the number of buckets or 
slots for the sketch
+ * @param tgtHllType the target type of the HllSketch to be used (HLL_4, 
HLL_6, HLL_8)
+ */
+@ExpressionDescription(
+  usage = """
+_FUNC_(expr, lgConfigK, tgtHllType) - Returns the HllSketch's compact 
binary representation.
+  `lgConfigK` (optional) the log-base-2 of K, with K = the number of 
buckets for the HllSketch.
+  `tgtHllType` (optional) the target type of the HllSketch (HLL_4, HLL_6, 
HLL_8). """,
+  examples = """
+Examples:
+  > SELECT hll_sketch_estimate(_FUNC_(col1, 12, 'HLL_4'))
+  FROM VALUES (1), (1), (2), (2), (3) tab(col1);
+   3
+  """,
+  group = "agg_funcs",
+  since = "3.5.0")
+case class HllSketchAgg(
+child: Expression,
+lgConfigKExpression: Expression,
+tgtHllTypeExpression: Expression,
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends TypedImperativeAggregate[HllSketch] with TernaryLike[Expression] 
with ExpectsInputTypes {
+
+  // Hllsketch config - mark as lazy so that they're not evaluated during tree 
transformation.
+
+  lazy val lgConfigK: Int = second.eval().asInstanceOf[Int]
+  lazy val tgtHllType: TgtHllType = try {
+
TgtHllType.valueOf(third.eval().asInstanceOf[UTF8String].toString.toUpperCase(Locale.ROOT))
+  } catch {
+case _: IllegalArgumentException =>
+  throw new SketchesArgumentException("Invalid tgtHllType value")
+  }
+
+  // Constructors
+
+  def this(child: Expression) = {
+this(child, Literal(HllSketch.DEFAULT_LG_K), 
Literal(HllSketch.DEFAULT_HLL_TYPE.toString), 0, 0)
+  }
+
+  def this(child: Expression, lgConfigK: Expression) = {
+this(child, lgConfigK, Literal(HllSketch.DEFAULT_HLL_TYPE.toString), 0, 0)
+  }
+
+  def this(child: Expression, lgConfigK: Int) = {
+this(child, Literal(lgConfigK), 
Literal(HllSketch.DEFAULT_HLL_TYPE.toString), 0, 0)
+  }
+
+  def this(child: Expression, lgConfigK: Expression, tgtHllType: Expression) = 
{
+this(child, lgConfigK, tgtHllType, 0, 0)
+  }
+
+  def this(child: Expression, lgConfigK: Int, tgtHllType: String) = {
+this(child, Literal(lgConfigK), Literal(tgtHllType), 0, 0)
+  }
+
+  // Copy constructors required by ImperativeAggregate
+
+  override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): 
HllSketchAgg =
+copy(mutableAggBufferOffset = newMutableAggBufferOffset)
+
+  override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): 
HllSketchAgg =
+copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override protected def withNewChildrenInternal(newFirst: Expression,
+  newSecond: Expression,
+  newThird: Expression): 
HllSketchAgg =
+copy(child = newFirst, 

[GitHub] [spark] amaliujia commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


amaliujia commented on code in PR #40931:
URL: https://github.com/apache/spark/pull/40931#discussion_r1175520012


##
common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala:
##
@@ -34,7 +33,7 @@ private[spark] object ErrorMessageFormat extends Enumeration {
  */
 private[spark] object SparkThrowableHelper {
   val errorReader = new ErrorClassesJsonReader(
-Seq(Utils.getSparkClassLoader.getResource("error/error-classes.json")))
+Seq(getClass.getClassLoader.getResource("error/error-classes.json")))

Review Comment:
   Yes Utils is a bit big and I need to tackle that separately with a strategy



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub


pan3793 commented on PR #40920:
URL: https://github.com/apache/spark/pull/40920#issuecomment-1520446235

   cc @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


hvanhovell commented on code in PR #40931:
URL: https://github.com/apache/spark/pull/40931#discussion_r1175497831


##
common/utils/src/main/scala/org/apache/spark/ErrorClassesJSONReader.scala:
##
@@ -30,6 +29,7 @@ import org.apache.commons.text.StringSubstitutor
 
 import org.apache.spark.annotation.DeveloperApi
 
+

Review Comment:
   Remove line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


hvanhovell commented on code in PR #40931:
URL: https://github.com/apache/spark/pull/40931#discussion_r1175496791


##
common/utils/src/main/scala/org/apache/spark/SparkThrowableHelper.scala:
##
@@ -34,7 +33,7 @@ private[spark] object ErrorMessageFormat extends Enumeration {
  */
 private[spark] object SparkThrowableHelper {
   val errorReader = new ErrorClassesJsonReader(
-Seq(Utils.getSparkClassLoader.getResource("error/error-classes.json")))
+Seq(getClass.getClassLoader.getResource("error/error-classes.json")))

Review Comment:
   We should move part of the `Utils` as well. We should do that in a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #37630: [SPARK-40193][SQL] Merge subquery plans with different filters

2023-04-24 Thread via GitHub


peter-toth commented on PR #37630:
URL: https://github.com/apache/spark/pull/37630#issuecomment-1520435320

   I extracted the first commit of this PR, that just moves 
`MergeScalarSubqueries` from `spark-catalyst` to `spark-sql` to 
https://github.com/apache/spark/pull/40932 to make the actual change of this PR 
clearer once https://github.com/apache/spark/pull/40932 is merged.
   
   https://github.com/apache/spark/pull/40932


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth opened a new pull request, #40932: [SPARK-43266][SQL] Move MergeScalarSubqueries to spark-sql

2023-04-24 Thread via GitHub


peter-toth opened a new pull request, #40932:
URL: https://github.com/apache/spark/pull/40932

   ### What changes were proposed in this pull request?
   This PR moves `MergeScalarSubqueries` from `catalyst` to `spark-sql`
   
   ### Why are the changes needed?
   Make  SPARK-40193 / https://github.com/apache/spark/pull/37630 easier.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Existing UTs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #40879: [SPARK-43217] Correctly recurse in nested maps/arrays in findNestedField

2023-04-24 Thread via GitHub


cloud-fan closed pull request #40879: [SPARK-43217] Correctly recurse in nested 
maps/arrays in findNestedField
URL: https://github.com/apache/spark/pull/40879


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40879: [SPARK-43217] Correctly recurse in nested maps/arrays in findNestedField

2023-04-24 Thread via GitHub


cloud-fan commented on PR #40879:
URL: https://github.com/apache/spark/pull/40879#issuecomment-1520408175

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


amaliujia commented on PR #40931:
URL: https://github.com/apache/spark/pull/40931#issuecomment-1520403075

   @cloud-fan @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #40931: [SPARK-43265] Move Error framework to a common utils module

2023-04-24 Thread via GitHub


amaliujia opened a new pull request, #40931:
URL: https://github.com/apache/spark/pull/40931

   
   
   ### What changes were proposed in this pull request?
   
   Move Error framework to a common utils module so that we can share it 
between Spark and Spark Connect without introducing heavy dependencies on Spark 
Connect module.
   
   ### Why are the changes needed?
   
   Reduce Dependencies on Spark Connect.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Error framework is internally API so this should be fine.
   
   ### How was this patch tested?
   
   Existing UT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1175452917


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##
@@ -203,6 +203,21 @@ trait FileFormat {
* method. Technically, a file format could choose suppress them, but that 
is not recommended.
*/
   def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+  /**
+   * The extractors to use when deriving file-constant metadata columns for 
this file format.
+   *
+   * By default, the value of a file-constant metadata column is obtained by 
looking up the column's
+   * name in the file's metadata column value map. However, implementations 
can override this method
+   * in order to provide an extractor that has access to the entire 
[[PartitionedFile]] when
+   * deriving the column's value.
+   *
+   * NOTE: Extractors are lazy, invoked only if the query actually selects 
their column at runtime.
+   *
+   * See also [[FileFormat.getFileConstantMetadataColumnValue]].
+   */
+  def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] =

Review Comment:
   > Can we add more comments in the new API, saying that it may only handle 
part of the constant metadata columns? The rest of them should be handled by a 
custom FileIndex.
   
   There was already one comment to that effect in the PR that added custom 
metadata column support:
   
https://github.com/apache/spark/pull/40677/files#diff-2d7ec6604810e1be58ca7852ec389853c143b684894f2390ffe07f96701d2219R29
   
   And the unit tests also are a pretty clear demonstration. 
   
   Good point that the comments are buried pretty deep tho -- I'll try to 
add/expand them in FileFormat.scala as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1175449282


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##
@@ -203,6 +203,21 @@ trait FileFormat {
* method. Technically, a file format could choose suppress them, but that 
is not recommended.
*/
   def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+  /**
+   * The extractors to use when deriving file-constant metadata columns for 
this file format.
+   *
+   * By default, the value of a file-constant metadata column is obtained by 
looking up the column's
+   * name in the file's metadata column value map. However, implementations 
can override this method
+   * in order to provide an extractor that has access to the entire 
[[PartitionedFile]] when
+   * deriving the column's value.
+   *
+   * NOTE: Extractors are lazy, invoked only if the query actually selects 
their column at runtime.
+   *
+   * See also [[FileFormat.getFileConstantMetadataColumnValue]].
+   */
+  def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] =

Review Comment:
   > Basically, we get the value of a constant metadata column from two maps: 
one map is filled by FileIndex and contains the value directly. The other map 
is filled by the FileFormat implementation and contains extractors.
   
   Nit: The extractors don't "contain" the value at all. They just tell how to 
extract a value from the `PartitionedFile` that the `FileIndex` provides. I 
don't know a cleaner way to do this, given that the extraction policy should 
probably be consistent across all files, rather than re-defined by each file? 
(prior to this PR, the "policy" was hard-wired as "look it up in the map"; now 
it's allowed to also be "do arbitrary computation on `PartitionedFile`, if the 
user prefers)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub


pan3793 commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175444671


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##
@@ -789,22 +789,22 @@ case class RepairTableCommand(
 if (partitionSpecsAndLocs.length > threshold) {
   val hadoopConf = spark.sessionState.newHadoopConf()
   val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-  val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+  val locations = partitionSpecsAndLocs.map(_._2).toArray

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub


pan3793 commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175443729


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##
@@ -789,22 +789,22 @@ case class RepairTableCommand(
 if (partitionSpecsAndLocs.length > threshold) {
   val hadoopConf = spark.sessionState.newHadoopConf()
   val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-  val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+  val locations = partitionSpecsAndLocs.map(_._2).toArray
 
   // Set the number of parallelism to prevent following file listing from 
generating many tasks
   // in case of large #defaultParallelism.
-  val numParallelism = Math.min(serializedPaths.length,
+  val numParallelism = Math.min(locations.length,

Review Comment:
   Yes, changed to evaluate once as `partitionNum`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub


pan3793 commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175442719


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##
@@ -789,22 +789,22 @@ case class RepairTableCommand(
 if (partitionSpecsAndLocs.length > threshold) {
   val hadoopConf = spark.sessionState.newHadoopConf()
   val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-  val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+  val locations = partitionSpecsAndLocs.map(_._2)
 
   // Set the number of parallelism to prevent following file listing from 
generating many tasks
   // in case of large #defaultParallelism.
-  val numParallelism = Math.min(serializedPaths.length,
+  val numParallelism = Math.min(locations.length,
 Math.min(spark.sparkContext.defaultParallelism, 1))
   // gather the fast stats for all the partitions otherwise Hive metastore 
will list all the
   // files for all the new partitions in sequential way, which is super 
slow.
   logInfo(s"Gather the fast stats in parallel using $numParallelism 
tasks.")
-  spark.sparkContext.parallelize(serializedPaths, numParallelism)
-.mapPartitions { paths =>
+  spark.sparkContext.parallelize(locations, numParallelism)
+.mapPartitions { locationsEachPartition =>
   val pathFilter = getPathFilter(serializableConfiguration.value)
-  paths.map(new Path(_)).map{ path =>
-val fs = path.getFileSystem(serializableConfiguration.value)
-val statuses = fs.listStatus(path, pathFilter)
-(path.toString, PartitionStatistics(statuses.length, 
statuses.map(_.getLen).sum))
+  locationsEachPartition.map { location =>
+val fs = location.getFileSystem(serializableConfiguration.value)
+val statuses = fs.listStatus(location, pathFilter)
+(location.toString, PartitionStatistics(statuses.length, 
statuses.map(_.getLen).sum))

Review Comment:
   Looks simpler, adopted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] majdyz commented on pull request #40929: [SPARK-43264][SQL] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader

2023-04-24 Thread via GitHub


majdyz commented on PR #40929:
URL: https://github.com/apache/spark/pull/40929#issuecomment-1520371824

   @LuciferYang Thanks, I think it's already been enabled now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] majdyz closed pull request #40929: [SPARK-43264][SQL] Avoid allocation of unwritten ColumnVector in Spark Vectorized Reader

2023-04-24 Thread via GitHub


majdyz closed pull request #40929: [SPARK-43264][SQL] Avoid allocation of 
unwritten ColumnVector in Spark Vectorized Reader
URL: https://github.com/apache/spark/pull/40929


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks commented on a diff in pull request #40885: [SPARK-43226] Define extractors for file-constant metadata

2023-04-24 Thread via GitHub


ryan-johnson-databricks commented on code in PR #40885:
URL: https://github.com/apache/spark/pull/40885#discussion_r1175437133


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala:
##
@@ -203,6 +203,21 @@ trait FileFormat {
* method. Technically, a file format could choose suppress them, but that 
is not recommended.
*/
   def metadataSchemaFields: Seq[StructField] = FileFormat.BASE_METADATA_FIELDS
+
+  /**
+   * The extractors to use when deriving file-constant metadata columns for 
this file format.
+   *
+   * By default, the value of a file-constant metadata column is obtained by 
looking up the column's
+   * name in the file's metadata column value map. However, implementations 
can override this method
+   * in order to provide an extractor that has access to the entire 
[[PartitionedFile]] when
+   * deriving the column's value.
+   *
+   * NOTE: Extractors are lazy, invoked only if the query actually selects 
their column at runtime.
+   *
+   * See also [[FileFormat.getFileConstantMetadataColumnValue]].
+   */
+  def fileConstantMetadataExtractors: Map[String, PartitionedFile => Any] =

Review Comment:
   I made a dummy PR https://github.com/apache/spark/pull/40930 to show what 
the other approach actually looks like. I still think it is uglier code for 
everyone involved, with no benefit to safety, but at least this way we have 
something concrete to compare.
   
   The change for the other PR vs master is:
   ```

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
 |  33 +---

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
  | 145 
--

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 |  70 
+++

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
  |  10 -

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
   |   4 ++--

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCustomMetadataStructSuite.scala
 |  71 
6 files changed, 196 insertions(+), 137 deletions(-)
   ```
   
   The change of this PR vs. master touches fewer files and is ~50LoC smaller:
   ```

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
 |  33 +++--

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
  | 102 
--

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 |  70 
+-

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
  |  10 +-

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCustomMetadataStructSuite.scala
 |  35 ++-
5 files changed, 143 insertions(+), 107 deletions(-)
   ```
   
   The other PR vs. this one is changes 3 files and ~100LoC:
   ```

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala
  | 99 
---

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
   |  4 ++--

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceCustomMetadataStructSuite.scala
 | 70 --
3 files changed, 98 insertions(+), 75 deletions(-)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40847: [SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml`

2023-04-24 Thread via GitHub


LuciferYang commented on PR #40847:
URL: https://github.com/apache/spark/pull/40847#issuecomment-1520363712

   More
   1. The conclusion using hadoop 3.0.x and hadoop 3.1.x is the same
   2. User hadoop 3.2.x can't build `hadoop-cloud` module too
   3. Currently, only hadoop 3.3. x can build all modules


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ryan-johnson-databricks opened a new pull request, #40930: [DO NOT MERGE] File constant metadata extractors split

2023-04-24 Thread via GitHub


ryan-johnson-databricks opened a new pull request, #40930:
URL: https://github.com/apache/spark/pull/40930

   
   
   ### What changes were proposed in this pull request?
   
   Experimental PR in response to 
https://github.com/apache/spark/pull/40885#discussion_r1174277575, so that 
reviewers can visualize the difference between the two approaches discussed 
there. NOT FOR MERGE -- If this approach is chosen, I will update the other PR 
accordingly. 
   
   Only the last commit differs from the other PR.
   
   ### Why are the changes needed?
   
   N/A
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   N/A


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub


LuciferYang commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175425968


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##
@@ -789,22 +789,22 @@ case class RepairTableCommand(
 if (partitionSpecsAndLocs.length > threshold) {
   val hadoopConf = spark.sessionState.newHadoopConf()
   val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-  val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+  val locations = partitionSpecsAndLocs.map(_._2)
 
   // Set the number of parallelism to prevent following file listing from 
generating many tasks
   // in case of large #defaultParallelism.
-  val numParallelism = Math.min(serializedPaths.length,
+  val numParallelism = Math.min(locations.length,
 Math.min(spark.sparkContext.defaultParallelism, 1))
   // gather the fast stats for all the partitions otherwise Hive metastore 
will list all the
   // files for all the new partitions in sequential way, which is super 
slow.
   logInfo(s"Gather the fast stats in parallel using $numParallelism 
tasks.")
-  spark.sparkContext.parallelize(serializedPaths, numParallelism)
-.mapPartitions { paths =>
+  spark.sparkContext.parallelize(locations, numParallelism)
+.mapPartitions { locationsEachPartition =>
   val pathFilter = getPathFilter(serializableConfiguration.value)
-  paths.map(new Path(_)).map{ path =>
-val fs = path.getFileSystem(serializableConfiguration.value)
-val statuses = fs.listStatus(path, pathFilter)
-(path.toString, PartitionStatistics(statuses.length, 
statuses.map(_.getLen).sum))
+  locationsEachPartition.map { location =>
+val fs = location.getFileSystem(serializableConfiguration.value)
+val statuses = fs.listStatus(location, pathFilter)
+(location.toString, PartitionStatistics(statuses.length, 
statuses.map(_.getLen).sum))

Review Comment:
   How about change the return type to `Map[Path, PartitionStatistics]`, not 
sure if it would be better
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub


LuciferYang commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175422370


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##
@@ -789,22 +789,22 @@ case class RepairTableCommand(
 if (partitionSpecsAndLocs.length > threshold) {
   val hadoopConf = spark.sessionState.newHadoopConf()
   val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-  val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+  val locations = partitionSpecsAndLocs.map(_._2).toArray
 
   // Set the number of parallelism to prevent following file listing from 
generating many tasks
   // in case of large #defaultParallelism.
-  val numParallelism = Math.min(serializedPaths.length,
+  val numParallelism = Math.min(locations.length,

Review Comment:
   Are `locations.length` and `partitionSpecsAndLocs.length` equal?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40920: [SPARK-43248][SQL] Unnecessary serialize/deserialize of Path on parallel gather partition stats

2023-04-24 Thread via GitHub


LuciferYang commented on code in PR #40920:
URL: https://github.com/apache/spark/pull/40920#discussion_r1175420930


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##
@@ -789,22 +789,22 @@ case class RepairTableCommand(
 if (partitionSpecsAndLocs.length > threshold) {
   val hadoopConf = spark.sessionState.newHadoopConf()
   val serializableConfiguration = new SerializableConfiguration(hadoopConf)
-  val serializedPaths = partitionSpecsAndLocs.map(_._2.toString).toArray
+  val locations = partitionSpecsAndLocs.map(_._2).toArray

Review Comment:
   Can we remove `toArray`?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >